WO2019110120A1

WO2019110120A1 - Template matching function for bi-predictive mv refinement

Info

Publication number: WO2019110120A1
Application number: PCT/EP2017/082047
Authority: WO
Inventors: Semih Esenlik; Zhijie Zhao; Anand Meher KOTRA; Han GAO
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2017-12-08
Filing date: 2017-12-08
Publication date: 2019-06-13

Abstract

The present disclosure relates to motion vector determination based on template matching for bi-directional motion vector estimation. In particular, a block template is constructed as an average of blocks pointed to by initial motion vectors to be refined. Then, the motion vector refinement is performed by template matching in two different reference pictures. The matching is performed by finding an optimum (minimum or maximum, depending on the function) of a matching function corresponding to the best matching block in each of the two reference pictures. The optimum is searched for zero-mean template and a zero-mean candidate block (among the block positions pointed to by the motion vector candidates of the search space). In other words, before performing the function optimization, a mean of the template is subtracted from the template and a mean of the candidate block is subtracted from the candidate block. The predictor of the current block is then calculated as a weighted average of these best matching blocks in the respective reference pictures.

Description

Template matching function for bi-predictive MV refinement

The present invention relates to encoding and decoding of video and in particular to determination of motion vectors.

BACKGROUND

Current hybrid video codecs, such as H.264/AVC or H.265/HEVC, employ compression including predictive coding. A picture of a video sequence is subdivided into blocks of pixels and these blocks are then coded. Instead of coding a block pixel by pixel, the entire block is predicted using already encoded pixels in the spatial or temporal proximity of the block. The encoder further processes only the differences between the block and its prediction. The further processing typically includes a transformation of the block pixels into coefficients in a transformation domain. The coefficients may then be further compressed by means of quantization and further compacted by entropy coding to form a bitstream. The bitstream further includes any signaling information which enables the decoding of the encoded video. For instance, the signaling may include settings concerning the encoding such as size of the input picture, frame rate, quantization step indication, prediction applied to the blocks of the pictures, or the like. The coded signaling information and the coded signal are ordered within the bitstream in a manner known to both the encoder and the decoder. This enables the decoder parsing the coded signaling information and the coded signal.

Temporal prediction exploits temporal correlation between pictures, also referred to as frames, of a video. The temporal prediction is also called inter-prediction, as it is a prediction using the dependencies between (inter) different video frames. Accordingly, a block being encoded, also referred to as a current block, is predicted from one or more previously encoded picture(s) referred to as a reference picture(s). A reference picture is not necessarily a picture preceding the current picture in which the current block is located in the displaying order of the video sequence. The encoder may encode the pictures in a coding order different from the displaying order. As a prediction of the current block, a co-located block in a reference picture may be determined. The co-located block is a block which is located in the reference picture on the same position as is the current block in the current picture. Such prediction is accurate for motionless picture regions, i.e. picture regions without movement from one picture to another. In order to obtain a predictor which takes into account the movement, i.e. a motion compensated predictor, motion estimation is typically employed when determining the prediction of the current block. Accordingly, the current block is predicted by a block in the reference picture, which is located in a distance given by a motion vector from the position of the co-located block. In order to enable a decoder to determine the same prediction of the current block, the motion vector may be signaled in the bitstream. In order to further reduce the signaling overhead caused by signaling the motion vector for each of the blocks, the motion vector itself may be estimated. The motion vector estimation may be performed based on the motion vectors of the neighboring blocks in spatial and/or temporal domain.

The prediction of the current block may be computed using one reference picture or by weighting predictions obtained from two or more reference pictures. The reference picture may be an adjacent picture, i.e. a picture immediately preceding and/or the picture immediately following the current picture in the display order since adjacent pictures are most likely to be similar to the current picture. However, in general, the reference picture may be also any other picture preceding or following the current picture in the displaying order and preceding the current picture in the bitstream (decoding order). This may provide advantages for instance in case of occlusions and/or non-linear movement in the video content. The reference picture identification may thus be also signaled in the bitstream.

A special mode of the inter-prediction is a so-called bi-prediction in which two reference pictures are used in generating the prediction of the current block. In particular, two predictions determined in the respective two reference pictures are combined into a prediction signal of the current block. The bi-prediction may result in a more accurate prediction of the current block than the uni-prediction, i.e. prediction only using a single reference picture. The more accurate prediction leads to smaller differences between the pixels of the current block and the prediction (referred to also as“residuals”), which may be encoded more efficiently, i.e. compressed to a shorter bitstream. In general, more than two reference pictures may be used to find respective more than two reference blocks to predict the current block, i.e. a multi-reference inter prediction can be applied. The term multi reference prediction thus includes bi-prediction as well as predictions using more than two reference pictures.

In order to provide more accurate motion estimation, the resolution of the reference picture may be enhanced by interpolating samples between pixels. Fractional pixel interpolation can be performed by weighted averaging of the closest pixels. In case of half-pixel resolution, for instance a bilinear interpolation is typically used. Other fractional pixels are calculated as an average of the closest pixels weighted by the inverse of the distance between the respective closest pixels to the pixel being predicted.

The motion vector estimation is a computationally complex task in which a similarity is calculated between the current block and the corresponding prediction blocks pointed to by candidate motion vectors in the reference picture. Typically, the search region includes M x M samples of the image and each of the sample position of the M x M candidate positions is tested. The test includes calculation of a similarity measure between the N x N reference block C and a block R, located at the tested candidate position of the search region. For its simplicity, the sum of absolute differences (SAD) is a measure frequently used for this purpose and given by:

In the above formula, x and y define the candidate position within the search region, while indices i and j denote samples within the reference block C and candidate block R. The candidate position is often referred to as block displacement or offset, which reflects the representation of the block matching as shifting of the reference block within the search region and calculating a similarity between the reference block C and the overlapped portion of the search region. In order to reduce the complexity, the number of candidate motion vectors is usually reduced by limiting the candidate motion vectors to a certain search space. The search space may be, for instance, defined by a number and/or positions of pixels surrounding the position in the reference picture corresponding to the position of the current block in the current image. After calculating SAD for all M x M candidate positions x and y, the best matching block R is the block on the position resulting in the lowest SAD, corresponding to the largest similarity with reference block C. On the other hand, the candidate motion vectors may be defined by a list of candidate motion vectors formed by motion vectors of neighboring blocks.

Motion vectors are usually at least partially determined at the encoder side and signaled to the decoder within the coded bitstream. However, the motion vectors may also be derived at the decoder. In such case, the current block is not available at the decoder and cannot be used for calculating the similarity to the blocks to which the candidate motion vectors point in the reference picture. Therefore, instead of the current block, a template is used which is constructed out of pixels of already decoded blocks. For instance, already decoded pixels adjacent to the current block may be used. Such motion estimation provides an advantage of reducing the signaling: the motion vector is derived in the same way at both the encoder and the decoder and thus, no signaling is needed. On the other hand, the accuracy of such motion estimation may be lower.

In order to provide a tradeoff between the accuracy and signaling overhead, the motion vector estimation may be divided into two steps: motion vector derivation and motion vector refinement. For instance, a motion vector derivation may include selection of a motion vector from the list of candidates. Such a selected motion vector may be further refined for instance by a search within a search space. The search in the search space is based on calculating cost function for each candidate motion vector, i.e. for each candidate position of block to which the candidate motion vector points.

Document JVET-D0029: Decoder-Side Motion Vector Refinement Based on Bilateral Template Matching, X. Chen, J. An, J. Zheng (the document can be found at: http://phenix.it- sudparis.eu/jvet/ site) shows motion vector refinement in which a first motion vector in integer pixel resolution is found and further refined by a search with a half-pixel resolution in a search space around the first motion vector. Bi-directional motion vector search based on a block template is employed.

Motion vector estimation is a key feature of modern video coders and decoders since its efficiency in terms of quality, rate and complexity has impact on the efficiency of the video coding and decoding.

SUMMARY OF INVENTION

The problem underlying the present disclosure is to provide an improved motion vector determination.

This is achieved by the features of independent claims.

Some further exemplary embodiments are subject matter of the dependent claims.

According to the present disclosure, template matching function for finding among candidate blocks the best matching block operates on a template with removed mean and a candidate block with removed mean.

According to an aspect of the disclosure, an apparatus is provided for determining a predictor of a current image block, the apparatus comprising a processing circuitry configured to: determine a template for the current block based on blocks pointed to by initial motion vectors pointing to two different respective reference pictures; find a best matching motion vector among candidate motion vectors by optimizing a function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples; and derive the predictor for the current block based on block samples of a best-matching block pointed to by the best matching motion vector.

For instance, the predictor for the current block is derived without applying to the block samples a local illumination compensation.

For example, the processing circuitry is configured to: determine the template as a weighted average of samples pointed to by two candidate motion vectors in two respective different reference pictures; perform motion vector refinement of the initial motion vectors using the template; and derive the predictor from the block samples pointed to by the refined best matching motion vector.

In one implementation the function is sum of absolute differences.

According to an exemplary embodiment, the processing circuitry is configured to determine: the estimate of mean of the template by averaging a subset of the samples included in the template; and/or the estimate of mean of the candidate block samples by averaging a subset of said candidate block samples.

In addition, or alternatively, the processing circuitry is configured to determine the estimate of mean of the candidate block samples by a function of samples different from the candidate block samples, but located in the same reference picture as the candidate block samples.

According to an embodiment, the processing circuitry is further configured to: find a first best matching motion vector among a first set of candidate motion vectors by optimizing a function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples; find a second best matching motion vector among a second set of candidate motion vectors by optimizing a function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples; and derive the predictor for the current block based on the block samples of a first best-matching block pointed to by the first best matching motion vector and a second best-matching block pointed to by the second best matching motion vector without applying to the block samples a local illumination compensation. According to another aspect of the disclosure, an apparatus for encoding a current image block is provided, the apparatus comprising: the apparatus for determining a predictor of the current image block according to any of the above mentioned embodiments and examples; an encoding controller configured to select whether the apparatus for determining the predictor is to apply: (i) either the function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples, or (ii) the same function of the template and the candidate block samples pointed to by a candidate motion vector; and a bitstream generator configured to generate a bitstream and to include therein as a syntax element an indicator indicating the result of the selection.

According to another aspect of the disclosure, an apparatus is provided for decoding a current image block from a bitstream comprising: the apparatus for determining a predictor of the current image block according to any of the above mentioned embodiments and examples; a bitstream parser for parsing from the bitstream a syntax element indicating a selection of whether the apparatus for determining the predictor is to apply: (i) either the function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples, or (ii) the same function of the template and the candidate block samples pointed to by a candidate motion vector; and a decoder controller configured to control the apparatus for determining the predictor according to the selection.

Another aspect of the present invention provides a method for determining a predictor of a current image block, the method comprising the steps of: determining a template for the current block based on blocks pointed to by initial motion vectors pointing to two different respective reference pictures; finding a best matching motion vector among candidate motion vectors by optimizing a function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples; deriving the predictor for the current block based on block samples of a best-matching block pointed to by the best matching motion vector without applying to the block samples a local illumination compensation.

The method may further comprise: determining the template as a weighted average of samples pointed to by two candidate motion vectors in two respective different reference pictures; refining the initial motion vectors using the template; deriving the predictor from the block samples pointed to by the refined best matching motion vectors. For instance, the function is sum of absolute differences.

In an exemplary embodiment, the method comprising the step of determining: the estimate of mean of the template by averaging a subset of the samples included in the template; and/or the estimate of mean of the candidate block samples by averaging a subset of said candidate block samples.

The method may further comprise determining the estimate of mean of the candidate block samples by a function of samples different from the candidate block samples, but located in the same reference picture as the candidate block samples.

According to an embodiment, the method comprises the steps of: finding a first best matching motion vector among a first set of candidate motion vectors by optimizing a function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples; finding a second best matching motion vector among a second set of candidate motion vectors by optimizing a function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples; and deriving the predictor for the current block based on the block samples of a first best-matching block pointed to by the first best matching motion vector and a second best-matching block pointed to by the second best matching motion vector without applying to the block samples a local illumination compensation.

According to an aspect of the disclosure, a method is provided for encoding a current image block, the method comprising: determining a predictor of the current image block according to any of the above described methods; selecting whether the apparatus for determining the predictor is to apply: (i) either the function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples, or (ii) the same function of the template and the candidate block samples pointed to by a candidate motion vector; and generating a bitstream and to include therein as a syntax element an indicator indicating the result of the selection.

According to an aspect of the disclosure, a method for decoding a current image block from a bitstream comprising: determining a predictor of the current image block according to any of the above mentioned methods; parsing from the bitstream a syntax element indicating a selection of whether the apparatus for determining the predictor is to apply: either the function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples, or the same function of the template and the candidate block samples pointed to by a candidate motion vector; and controlling the apparatus for determining the predictor according to the selection.

Moreover, according to another aspect of the invention, a non-transitory computer-readable medium is provided storing instructions which when executed on a processor perform the steps of a method as described above.

BRIEF DESCRIPTION OF DRAWINGS

In the following, exemplary embodiments are described in more detail with reference to the attached figures and drawings, in which:

Figure 1 is a block diagram showing an exemplary structure of an encoder for encoding video signals;

Figure 2 is a block diagram showing an exemplary structure of a decoder for decoding video signals;

Figure 3 is a schematic drawing illustrating an exemplary template matching suitable for bi-prediction;

Figure 4 is a schematic drawing illustrating an exemplary template matching suitable for uni- and bi-prediction;

Figure 5 is a flow diagram illustrating a possible implementation of motion vector search; Figure 6 is a schematic drawing illustrating local illumination compensation applied in video coding.

Figure 7 is a schematic drawing illustrating decoder-side motion vector refinement. Figure 8 is a flow diagram showing a possible implementation of a combination of frame rate up-conversion and local illumination compensation.

Figure 9 is a flow diagram showing a possible implementation of motion vector search. Figure 10 is a flow diagram a possible implementation of motion vector search. DETAILED DESCRIPTION

The present disclosure relates to improvement of template matching applied in motion vector refinement. In particular, the matching function is applied to zero-mean template and zero- mean candidate block even if the mean of the found best matching blocks is not further adjusted (by local illumination control).

The template matching is used to find the best first and second motion vectors which point to a first reference picture and a second reference picture, respectively. This is performed for each reference picture by template matching in a predetermined search space on a location given by an initial motion vector which may be derived or signaled to the decoder.

The template matching may be performed based on a block template derived from the blocks pointed to by the initial motion vectors.

Such template matching for finding the best matching blocks to obtain predictor for the current block may be employed, for instance, in hybrid video encoder and/or decoder. For example, application to an encoder and / or decoder such as HEVC or similar may be advantageous. In particular further developments of HEVC or new codecs / standards may make use of the present disclosure.

Fig. 1 shows an encoder 100 which comprises an input for receiving input image samples of frames or pictures of a video stream and an output for generating an encoded video bitstream. The term“frame” in this disclosure is used as a synonym for picture. However, it is noted that the present disclosure is also applicable to fields in case interlacing is applied. In general, a picture includes m times n pixels. This corresponds to image samples and may comprise one or more color components. For the sake of simplicity, the following description refers to pixels meaning samples of luminance. However, it is noted that the motion vector search of the invention can be applied to any color component including chrominance or components of a search space such as RGB or the like. On the other hand, it may be beneficial to only perform motion vector estimation for one component and to apply the determined motion vector to more (or all) components.

The input blocks to be coded do not necessarily have the same size. One picture may include blocks of different sizes and the block raster of different pictures may also differ.

In an explicative realization, the encoder 100 is configured to apply prediction, transformation, quantization, and entropy coding to the video stream. The transformation, quantization, and entropy coding are carried out respectively by a transform unit 106, a quantization unit 108 and an entropy encoding unit 170 so as to generate as an output the encoded video bitstream.

The video stream may include a plurality of frames, wherein each frame is divided into blocks of a certain size that are either intra or inter coded. The blocks of for example the first frame of the video stream are intra coded by means of an intra prediction unit 154. An intra frame is coded using only the information within the same frame, so that it can be independently decoded and it can provide an entry point in the bitstream for random access. Blocks of other frames of the video stream may be inter coded by means of an inter prediction unit 144: information from previously coded frames (reference frames) is used to reduce the temporal redundancy, so that each block of an inter-coded frame is predicted from a block in a reference frame. A mode selection unit 160 is configured to select whether a block of a frame is to be processed by the intra prediction unit 154 or the inter prediction unit 144. This mode selection unit 160 also controls the parameters of intra or inter prediction. In order to enable refreshing of the image information, intra-coded blocks may be provided within inter-coded frames. Moreover, intra-frames which contain only intra-coded blocks may be regularly inserted into the video sequence in order to provide entry points for decoding, i.e. points where the decoder can start decoding without having information from the previously coded frames.

The intra estimation unit 152 and the intra prediction unit 154 are units which perform the intra prediction. In particular, the intra estimation unit 152 may derive the prediction mode based also on the knowledge of the original image while intra prediction unit 154 provides the corresponding predictor, i.e. samples predicted using the selected prediction mode, for the difference coding. For performing spatial or temporal prediction, the coded blocks may be further processed by an inverse quantization unit 1 10, and an inverse transform unit 1 12. After reconstruction of the block a loop filtering unit 120 is applied to further improve the quality of the decoded image. The filtered blocks then form the reference frames that are then stored in a decoded picture buffer 130. Such decoding loop (decoder) at the encoder side provides the advantage of producing reference frames which are the same as the reference pictures reconstructed at the decoder side. Accordingly, the encoder and decoder side operate in a corresponding manner. The term“reconstruction” here refers to obtaining the reconstructed block by adding to the decoded residual block the prediction block.

The inter estimation unit 142 receives as an input a block of a current frame or picture to be inter coded and one or several reference frames from the decoded picture buffer 130. Motion estimation is performed by the inter estimation unit 142 whereas motion compensation is applied by the inter prediction unit 144. The motion estimation is used to obtain a motion vector and a reference frame based on certain cost function, for instance using also the original image to be coded. For example, the motion estimation unit 142 may provide initial motion vector estimation. The initial motion vector may then be signaled within the bitstream in form of the vector directly or as an index referring to a motion vector candidate within a list of candidates constructed based on a predetermined rule in the same way at the encoder and the decoder. The motion compensation then derives a predictor of the current block as a translation of a block co-located with the current block in the reference frame to the reference block in the reference frame, i.e. by a motion vector. The inter prediction unit 144 outputs the prediction block for the current block, wherein said prediction block minimizes the cost function. For instance, the cost function may be a difference between the current block to be coded and its prediction block, i.e. the cost function minimizes the residual block. The minimization of the residual block is based e.g. on calculating a sum of absolute differences (SAD) between all pixels (samples) of the current block and the candidate block in the candidate reference picture. However, in general, any other similarity metric may be employed, such as mean square error (MSE) or structural similarity metric (SSIM).

However, cost-function may also be the number of bits necessary to code such inter-block and/or distortion resulting from such coding. Thus, the rate-distortion optimization procedure may be used to decide on the motion vector selection and/or in general on the encoding parameters such as whether to use inter or intra prediction for a block and with which settings.

The intra estimation unit 152 and inter prediction unit 154 receive as an input a block of a current frame or picture to be intra coded and one or several reference samples from an already reconstructed area of the current frame. The intra prediction then describes pixels of a current block of the current frame in terms of a function of reference samples of the current frame. The intra prediction unit 154 outputs a prediction block for the current block, wherein said prediction block advantageously minimizes the difference between the current block to be coded and its prediction block, i.e., it minimizes the residual block. The minimization of the residual block can be based e.g. on a rate-distortion optimization procedure. In particular, the prediction block is obtained as a directional interpolation of the reference samples. The direction may be determined by the rate-distortion optimization and/or by calculating a similarity measure as mentioned above in connection with inter-prediction.

The inter estimation unit 142 receives as an input a block or a more universal-formed image sample of a current frame or picture to be inter coded and two or more already decoded pictures 231. The inter prediction then describes a current image sample of the current frame in terms of motion vectors to reference image samples of the reference pictures. The inter prediction unit 142 outputs one or more motion vectors for the current image sample, wherein said reference image samples pointed to by the motion vectors advantageously minimize the difference between the current image sample to be coded and its reference image samples, i.e., it minimizes the residual image sample. The predictor for the current block is then provided by the inter prediction unit 144 for the difference coding.

The difference between the current block and its prediction, i.e. the residual block 105, is then transformed by the transform unit 106. The transform coefficients 107 are quantized by the quantization unit 108 and entropy coded by the entropy encoding unit 170. The thus generated encoded picture data 171 , i.e. encoded video bitstream, comprises intra coded blocks and inter coded blocks and the corresponding signaling (such as the mode indication, indication of the motion vector, and/or intra-prediction direction). The transform unit 106 may apply a linear transformation such as a Fourier or Discrete Cosine Transformation (DFT/FFT or DCT). Such transformation into the spatial frequency domain provides the advantage that the resulting coefficients 107 have typically higher values in the lower frequencies. Thus, after an effective coefficient scanning (such as zig-zag), and quantization, the resulting sequence of values has typically some larger values at the beginning and ends with a run of zeros. This enables further efficient coding. Quantization unit 108 performs the actual lossy compression by reducing the resolution of the coefficient values. The entropy coding unit 170 then assigns to coefficient values binary codewords to produce a bitstream. The entropy coding unit 170 also codes the signaling information (not shown in Fig. 1 ).

Fig. 2 shows a video decoder 200. The video decoder 200 comprises particularly a decoded picture buffer 230, an inter prediction unit 244 and an intra prediction unit 254, which is a block prediction unit. The decoded picture buffer 230 is configured to store at least one (for uni-prediction) or at least two (for bi-prediction) reference frames reconstructed from the encoded video bitstream, said reference frames being different from a current frame (currently decoded frame) of the encoded video bitstream. The intra prediction unit 254 is configured to generate a prediction block, which is an estimate of the block to be decoded. The intra prediction unit 254 is configured to generate this prediction based on reference samples that are obtained from the decoded picture buffer 230.

The decoder 200 is configured to decode the encoded video bitstream generated by the video encoder 100, and preferably both the decoder 200 and the encoder 100 generate identical predictions for the respective block to be encoded / decoded. The features of the decoded picture buffer 230 and the intra prediction unit 254 are similar to the features of the decoded picture buffer 130 and the intra prediction unit 154 of Fig. 1. The video decoder 200 comprises further units that are also present in the video encoder 100 like e.g. an inverse quantization unit 210, an inverse transform unit 212, and a loop filtering unit 220, which respectively correspond to the inverse quantization unit 1 10, the inverse transform unit 1 12, and the loop filtering unit 120 of the video coder 100.

An entropy decoding unit 204 is configured to decode the received encoded video bitstream and to correspondingly obtain quantized residual transform coefficients 209 and signaling information. The quantized residual transform coefficients 209 are fed to the inverse quantization unit 210 and an inverse transform unit 212 to generate a residual block. The residual block is added to a prediction block 265 and the addition is fed to the loop filtering unit 220 to obtain the decoded video. Frames of the decoded video can be stored in the decoded picture buffer 230 and serve as a decoded picture 231 for inter prediction.

Generally, the intra prediction units 154 and 254 of Figs. 1 and 2 can use reference samples from an already encoded area to generate prediction signals for blocks that need to be encoded or need to be decoded.

The entropy decoding unit 204 receives as its input the encoded bitstream 171 . In general, the bitstream is at first parsed, i.e. the signaling parameters and the residuals are extracted from the bitstream. Typically, the syntax and semantic of the bitstream is defined by a standard so that the encoders and decoders may work in an interoperable manner. As described in the above Background section, the encoded bitstream does not only include the prediction residuals. In case of motion compensated prediction, a motion vector indication is also coded in the bitstream and parsed therefrom at the decoder. The motion vector indication may be given by means of a reference picture in which the motion vector is provided and by means of the motion vector coordinates. So far, coding the complete motion vectors was considered. However, also only the difference between the current motion vector and the previous motion vector in the bitstream may be encoded. This approach allows exploiting the redundancy between motion vectors of neighboring blocks.

In order to efficiently code the reference picture, H.265 codec (ITU-T, H265, Series H: Audiovisual and multimedia systems: High Efficient Video Coding) provides a list of reference pictures assigning to list indices respective reference frames. The reference frame is then signaled in the bitstream by including therein the corresponding assigned list index. Such list may be defined in the standard or signaled at the beginning of the video or a set of a number of frames. It is noted that in H.265 there are two lists of reference pictures defined, called LO and L1. The reference picture is then signaled in the bitstream by indicating the list (LO or L1 ) and indicating an index in that list associated with the desired reference picture. Providing two or more lists may have advantages for better compression. For instance, LO may be used for both uni-directionally inter-predicted slices and bi-directionally inter-predicted slices while L1 may only be used for bi-directionally inter-predicted slices. However, in general the present disclosure is not limited to any content of the L0 and L1 lists.

The lists L0 and L1 may be defined in the standard and fixed. However, more flexibility in coding/decoding may be achieved by signaling them at the beginning of the video sequence. Accordingly, the encoder may configure the lists LO and L1 with particular reference pictures ordered according to the index. The LO and L1 lists may have the same fixed size. There may be more than two lists in general. The motion vector may be signaled directly by the coordinates in the reference picture. Alternatively, as also specified in H.265, a list of candidate motion vectors may be constructed and an index associated in the list with the particular motion vector can be transmitted.

Motion vectors of the current block are usually correlated with the motion vectors of neighboring blocks in the current picture or in the earlier coded pictures. This is because neighboring blocks are likely to correspond to the same moving object with similar motion and the motion of the object is not likely to change abruptly over time. Consequently, using the motion vectors in neighboring blocks as predictors reduces the size of the signaled motion vector difference. The Motion Vector Predictors (MVPs) are usually derived from already encoded/decoded motion vectors from spatial neighboring blocks or from temporally neighboring blocks in the co-located picture. In H.264/AVC, this is done by doing a component wise median of three spatially neighboring motion vectors. Using this approach, no signaling of the predictor is required. Temporal MVPs from a co-located picture are only considered in the so called temporal direct mode of H.264/AVC. The H.264/AVC direct modes are also used to derive other motion data than the motion vectors. Hence, they relate more to the block merging concept in HEVC. In HEVC, the approach of implicitly deriving the MVP was replaced by a technique known as motion vector competition, which explicitly signals which MVP from a list of MVPs, is used for motion vector derivation. The variable coding quad-tree block structure in HEVC can result in one block having several neighboring blocks with motion vectors as potential MVP candidates. Taking the left neighbor as an example, in the worst case a 64x64 luma prediction block could have 16 4x4 luma prediction blocks to the left when a 64x64 luma coding tree block is not further split and the left one is split to the maximum depth.

Advanced Motion Vector Prediction (AMVP) was introduced to modify motion vector competition to account for such a flexible block structure. During the development of HEVC, the initial AMVP design was significantly simplified to provide a good trade-off between coding efficiency and an implementation friendly design. The initial design of AMVP included five MVPs from three different classes of predictors: three motion vectors from spatial neighbors, the median of the three spatial predictors and a scaled motion vector from a co located, temporally neighboring block. Furthermore, the list of predictors was modified by reordering to place the most probable motion predictor in the first position and by removing redundant candidates to assure minimal signaling overhead. The final design of the AMVP candidate list construction includes the following two MVP candidates: a) up to two spatial candidate MVPs that are derived from five spatial neighboring blocks; b) one temporal candidate MVPs derived from two temporal, co-located blocks when both spatial candidate MVPs are not available or they are identical; and c) zero motion vectors when the spatial, the temporal or both candidates are not available. Details on motion vector determination can be found in the book by V. Sze et al (Ed.), High Efficiency Video Coding (HEVC): Algorithms and Architectures, Springer, 2014, in particular in Chapter 5, incorporated herein by reference.

In order to further improve motion vector estimation without further increase in signaling overhead, it may be beneficial to further refine the motion vectors derived at the encoder side and provided in the bitstream. The motion vector refinement may be performed at the decoder without assistance from the encoder. The encoder in its decoder loop may employ the same refinement to obtain corresponding motion vectors. Motion vector refinement is performed in a search space which includes integer pixel positions and fractional pixel positions of a reference picture. For example, the fractional pixel positions may be half-pixel positions or quarter-pixel or further fractional positions. The fractional pixel positions may be obtained from the integer (full-pixel) positions by interpolation such as bi-linear interpolation.

In a bi-prediction of current block, two prediction blocks obtained using the respective first motion vector of list LO and the second motion vector of list L1 , are combined to a single prediction signal, which can provide a better adaptation to the original signal than uni prediction, resulting in less residual information and possibly a more efficient compression.

Since at the decoder, the current block is not available since it is being decoded, for the purpose of motion vector refinement, a template is used, which is an estimate of the current block and which is constructed based on the already processed (i.e. coded at the encoder side and decoded at the decoder side) image portions.

First, an estimate of the first motion vector MVO and an estimate of the second motion vector MV1 are received as input at the decoder 200. At the encoder side 100, the motion vector estimates MVO and MV1 may be obtained by block matching and/or by search in a list of candidates (such as merge list) formed by motion vectors of the blocks neighboring to the current block (in the same picture or in adjacent pictures). MVO and MV1 are then advantageously signaled to the decoder side within the bitstream. However, it is noted that in general, also the first determination stage at the encoder could be performed by template matching which would provide the advantage of reducing signaling overhead.

At the decoder side 200, the motion vectors MVO and MV1 are advantageously obtained based on information in the bitstream. The MVO and MV1 are either directly signaled, or differentially signaled, and/or an index in the list of motion vector (merge list) is signaled. However, the present disclosure is not limited to signaling motion vectors in the bitstream. Rather, the motion vector may be determined by template matching already in the first stage, correspondingly to the operation of the encoder. The template matching of the first stage (motion vector derivation) may be performed based on a search space different from the search space of the second, refinement stage. In particular, the refinement may be performed on a search space with higher resolution (i.e. shorter distance between the search positions).

An indication of the two reference pictures RefPicO and RefPid , to which respective MVO and MV1 point, are provided to the decoder as well. The reference pictures are stored in the decoded picture buffer at the encoder and decoder side as a result of previous processing, i.e. respective encoding and decoding. One of these reference pictures is selected for motion vector refinement by search. A reference picture selection unit of the apparatus for the determination of motion vectors is configured to select the first reference picture to which MVO points and the second reference picture to which MV1 points. Following the selection, the reference picture selection unit determines whether the first reference picture or the second reference picture is used for performing of motion vector refinement. For performing motion vector refinement, the search region in the first reference picture is defined around the candidate position to which motion vector MVO points. The candidate search space positions within the search region are analyzed to find a block most similar to a template block by performing template matching within the search space and determining a similarity metric such as the sum of absolute differences (SAD). The positions of the search space denote the positions on which the top left corner of the template is matched. As already mentioned above, the top left corner is a mere convention and any point of the search space such as the central point can in general be used to denote the matching position.

Figure 4 illustrates an alternative template matching which is also applicable for uni prediction. Details can be found in document JVET-A1001 , an in particular in Section„2.4.6. Pattern matched motion vector derivation" of document JVET-A1001 which is titled “Algorithm Description of Joint Exploration Test Model 1”, by Jianle Chen et. al. and which is accessible at: http://phenix.it-sudparis.eu/jvet/. The template in this template matching approach is determined as samples adjacent to the current bock in the current frame. As shown in Figure 1 , the already reconstructed samples adjacent to the top and left boundary of the current block may be taken, referred to as“L-shaped template”.

According to the above mentioned document JVET-D0029, the decoder-side motion vector refinement (DMVR) has as an input the initial motion vectors MVO and MV1 which point into two respective reference pictures RefPictO and RefPictl . These initial motion vectors are used for determining the respective search spaces in the RefPictO and RefPictl . Moreover, using the motion vectors MVO and MV1 , a template is constructed based on the respective blocks (of samples) A and B pointed to by MVO and MV1 as follows:

Template = function (Block A, Block B).

The function may be sample clipping operation in combination with sample-wise weighted summation. The template is then used to perform template matching in the search spaces determined based on MVO and MV1 in the respective reference pictures 0 and 1 . The cost function for determining the best template match in the respective search spaces is SAD(Template, Block candA), where block candA is the candidate coding block which is pointed by the candidate MV in the search space spanned on a position given by the MVO. Figure 3 illustrates the determination of the best matching block A and the resulting refined motion vector MVO’. Correspondingly, the same template is used to find best matching block B’ and the corresponding motion vector MVT which points to block B’ as shown in Figure 3. In other words, after the template is constructed based on the block A and B pointed to by the initial motion vectors MVO and MV1 , the refined motion vectors MVO’ and MV1’ are found via search on RefPicO and RefPid with the template.

Motion vector derivation techniques are sometimes also referred to as frame rate up- conversion (FRUC). The initial motion vectors MVO and MV1 may generally be indicated in the bitstream to ensure that encoder and decoder may use the same initial point for motion vector refinement. Alternatively, the initial motion vectors may be obtained by providing a list of initial candidates including one or more initial candidates. For each of them a refined motion vector is determined and at the end, the refined motion vector minimizing the cost function is selected.

As mentioned above, template-matched motion vector derivation mode is a special merge mode based on Frame- FRUC) techniques. With this mode, motion information of the block is derived at decoder side. According to the specific implementation that is described in the document JVET-A1001 (“Algorithm Description of Joint Exploration Test Model 1”, which is accessible under http://phenix.it-sudparis.eu/jvet/), FRUC flag is signaled for a CU or PU when the merge flag is true. When the FRUC flag is false, a merge index is signaled and the regular merge mode is used. When the FRUC flag is true, an additional FRUC mode flag is signalled to indicate which method (bilateral matching or template matching) is to be used to derive motion information for the block.

In summary, during the motion derivation process, an initial motion vector is first derived for the whole Prediction Unit (PU) based on bilateral matching or template matching. First, the merge list of the PU is checked and the candidate which leads to the minimum matching cost is selected as the starting point (initial motion vector). Then a local search based on bilateral matching or template matching around the starting point is performed and the Motion Vector(s) (MV) that result in the minimum matching cost is taken as the MV for the PU. Then the motion information is further refined level with the derived PU motion vectors as the starting points. The terms prediction unit (PU) and coding unit (CU) can be used interchangeably here to describe a block of samples within a picture(frame).

As shown in the Figure 4B, the bilateral matching (that is described in the document JVET- A1001 ) is used to derive motion information of the current CU by finding the closest match between two blocks along the motion trajectory of the current CU in two different reference pictures. Under the assumption of continuous motion trajectory, the motion vectors MVO and MV1 pointing to the two reference blocks shall be proportional to the temporal distances, i.e., TDO and TD1 , between the current picture and the two reference pictures. Accordingly, in one embodiment, in each tested candidate pair of vectors, the two respective vectors are on a straight line in the image plane. As a special case, when the current picture is temporally between the two reference pictures and the temporal distance from the current picture to the two reference pictures is the same, the bilateral matching becomes mirror based bi directional MV.

As shown in Figure 4A, template matching (that is described in the document JVET-A1001 ) is used to derive motion information of the current CU by finding the closest match between a template (top and/or left neighbouring blocks of the current CU) in the current picture and a block (same size to the template) in a reference picture. The“Pattern matched motion vector derivation” section of the document JVET-A1001 describe a specific implementation of the template matching and bilateral matching methods. As an example it is mentioned that the bilateral matching operation is applied only if “merge flag” is true, indicating that“block merging” operation mode is selected. Here the authors of the document JVET-A1001 are referring to the“merge mode” of the H.265 standard. It is noted that template matching and bilateral matching methods described in JVET-A1001 can also be applied to other video coding standards, resulting in variations in the specific implementation. Figure 5 is a flow diagram which visualized DMVR operating at a decoder side. According to document JVET-D0029, the DMVR is applied under two conditions: 1 ) the prediction type is set to skip mode or merge mode, 2) the prediction mode is bi-prediction. First initial motion vectors MVO (of reference list 0) and MV1 (of reference list 1 ) are derived. The derivation process is performed according to the respective skip and merge operation. Here the authors of the document JVET-D0029 refer to the skip mode and merge mode of the H.265 standard. Description of these modes can be found in Section 5.2.2.3“Merge Motion Data Signaling and Skip Mode” of the book by v. Sze, M. Budagavi and G.J. Sullivan (Ed.), High Efficiency Video Coding (HEVC), Algorithms and Architectures, 2014. In H.265 if the skip mode is used to indicate for a block that the motion data is inferred instead of explicitly signaled and that the prediction residual is zero, i.e. no transform coefficients are transmitted. If the merge mode is selected, the motion data is also inferred, but the prediction residual is not zero, i.e. transform coefficients are explicitly signaled.

The parse index is parsed 510 from the input video stream. The parsed index points to the best motion vector candidate of a MV candidate list, which is constructed 520. The best motion vector candidate is then selected 530 and template is obtained by weighted averaging 540. The DMVR 550 is applied as follows. A block template is calculated by adding together the blocks that are referred by MVO and MV1 as explained above with reference to Figure 3. Clipping is applied afterwards. The template is used to find a refined motion vector MVO’, around the initial motion vector MVO. Search region is integer pel- resolution (the points of the search space are spaced from each other by integer sample distance). Sum of Absolute Differences (SAD) cost measure is used to compare the template block and the new block pointed by MVO’. The template is used to find a refined MVO”, around the MVO’. Search region is half-pel resolution (the points of the search space are spaced from each other by half of the sample distance). Same cost measure is used. The latter two steps are repeated to find MV1”. The new bi-predicted block is formed by adding together the blocks pointed by MVO” and MV1”. The blocks block A’ and block B’ pointed to by such refined motion vectors MVO” and MV1” are then averaged (more specifically weighted averaging) 560 to obtain final prediction.

Figure 7 illustrates the performing a DMVR iteration on reference picture RefPictO. Current picture includes current block 710 for which motion vector should be found based on motion vector MVO in RefPictO. Search space including 5 integer positions is determined; the blocks pointed to by the candidate position are referred to as Ax. The output is the best matching of the blocks Ax pointed to by motion vector MVO’. Figure 6 illustrates another tool which may be employed in video coding and decoding. Local Illumination Compensation (LIC) is based on a linear model for illumination changes, using a scaling factor a and an offset b. LIC may be enabled or disabled adaptively for each inter mode coded coding unit (CU). When LIC applies for a CU, a least square error method may be employed to derive the parameters a and b by using the neighboring samples of the current CU and their corresponding reference samples. More specifically, as illustrated in Figure 6, the subsampled (2:1 subsampling) neighboring samples of the CU and the corresponding samples (identified by motion information of the current CU or sub-CU) in the reference picture are used. The LIC parameters are derived and applied for each prediction direction separately. Here, the subsampling 2:1 means that every second pixel on the current CU boundary and the reference block is taken. More details on the use of a scaling factor/ multiplicative weighting factor and the offset can be found in Section“2.4.4. Local illumination compensation” of JVET-A1001 document titled Algorithm Description of Joint Exploration Test Model 1 , by Jianle Chen et. al.

When LIC is enabled for a picture, additional CU level rate-distortion check may be performed to determine whether LIC is applied or not for a CU. When LIC is enabled for a CU, a mean-removed sum of absolute differences (MR-SAD) and a mean-removed sum of absolute Hadamard-transformed differences (MR-SATD) are used, instead of SAD and SATD, for integer pel motion search and fractional pel motion search, respectively.

Thus, depending on whether or not LIC is applied, the FRUC may be performed as shown in Figure 8. Figure 8 illustrates in step 810 performing of FRUC resulting in obtainment of initial motion vectors for bilateral prediction. If LIC is to be applied, in step 820,“yes”, then the matching function to find the best matching block is given by:

F1 = SAD (block A - mean(block A), block B - mean(block B)).

Block A represents one of the candidate blocks in reference picture A. During the search operation (bi-lateral matching), more than one block A is tested. Block B is the candidate block in reference picture B. According to the bilateral matching operation block B is pointed by a motion vector that is a mirrored and scaled version of motion vector that points to block A as was described with reference to Fig. 4B. In other words, in step 835, bi-lateral matching is applied to refine initial motion vectors and thereby to find the corresponding blocks A’ and B’. In Figure 8, block A’ and block B’ are the output of the bilateral matching operation, i.e. the best matching blocks according to the above matching function F1. The function mean(block A) describes the operation of averaging the intensity values of samples of the block A. The function SAD calculates the sum of the absolute differences of the values of samples of block A and the samples of block B. It can be rewritten as:

Where;

, where the w and h correspond to the width and height of the block in terms of samples, and the block_Ai represents the sample of the block A at coordinate (i,j).

In step 840 then LIC is applied to blocks A and B. Although LIC is applied, it does not necessarily change the pixel values of the block A or block B. In other words block A’ may be in some situations equal to block A” and block B’ to block B”. The blocks A” and B” output from the LIC 840 are then further averaged by weighted averaging step 855 to obtain prediction for the current block. The application of the LIC is given by the equation: block_A"i = block_A'i_j * a + b

A least square error method is employed to derive the parameters“a” and“b” by using the neighbouring samples of the current CU and their corresponding reference samples around block A’. The figure 1 1 of document JVET-A1001 describe how the neighboring and reference samples are selected. On general“a” is different than 1 and“b” is different than 0.

When LIC is not to be applied (“No” in step 820), the matching function to be used is given by:

F2 = SAD (block A, b!ockJB)

which can be rewritten as:

After finding the best matching blocks A’ and B’ in step 830, weighted averaging is applied to these best matching blocks in step 850 to obtain prediction of the current block. As can be seen from Figure 8 described above, the matching function F2, i.e. the SAD of block A and block B without mean removal, is applied if no LIC is applied. However, if LIC is applied, the matching function F1 is employed, i.e. the SAD is determined for the blocks block A and block B with the respective mean values removed (also denoted as mean- removed SAD, MRSAD). The right branch of the prediction method shown in Figure 8 is a combination of the different processes, in particular bi-lateral matching 835, and LIC 840.

This may not be always efficient. There is a certain synergy between the operation of bilateral matching 835 and the operation of LIC 840. In particular, LIC is used to predict the average illumination level, which corresponds to the mean value in the current block as illustrated in Figure 6. On the other hand, the bi-lateral matching is used to predict the texture details. Therefore, the matching function with the mean removed specifically targets the application of both, the LIC as well as the bi-lateral matching. In particular, the matching function F1 used in the bi-lateral matching step 835 subtracts the mean from the blocks because the mean will be later added by LIC in step 840. In step 835, the average illumination value of block A and block B does not have any effect on the bi-lateral matching, search operation, since the average value is to be changed in the following step. The LIC in step 840 modifies the value of the blocks A’ and B’.

In other words, it is shown in Figure 8 that MRSAD is applied based on the application of LIC, i.e. based on zero-mean block A and block B.

However, according to an embodiment of the present invention, MRSAD is applied not based on the application of LIC, i.e. not in combination with LIC. Figures 9 and 10 both show flow diagrams of motion vector search using DMVR. In both implementations, DMVR is applied. After it has been decided that DMVR is applied (step 905, conditions for DMVR application are mentioned in the description accompanying Figure 5), the remaining steps (index parsing 910, 1010, construction of a candidate list 920, 1020, selection of the best MV candidate 930, 1030, weighted averaging 940, 1040, motion vector refinement 950, 1050, and weighted averaging 960, 1060) are performed which correspond to the respective steps shown in Figure 5.

However, on the one hand, in motion vector refinement 950 of Figure 9, the SAD of a template and a reference block, block A from a reference picture, shown by a box between steps 940 and 950 is used as a cost function. On the other hand, in step 1050 of the motion vector refinement method shown in Figure 10, the cost function is changed with respect to Figure 9. In particular, in the motion vector refinement of Figure 10, the SAD is calculated from the template and the reference block, block A from which the respective mean values have been removed. I.e. an MRSAD function is calculated rather than a simple SAD. Accordingly, an apparatus is provided for determining a predictor of a current image block. The apparatus comprises processing circuitry configured to determine a template for a current block based on blocks pointed to by initial motion vectors, i.e. a first initial motion vector (MVO) and a second initial motion vector (MV1 ) which point to two different respective reference pictures, i.e. a first reference picture and a second reference picture (such as RefPicO and RefPid sown in Figure 7). The reference pictures may be two pictures in temporally different directions with respect to the current image, i.e. the image to which the current image block belongs. For example, the first initial motion vector points to a reference picture preceding the current image, and the second initial motion vector points to an image preceded by the current image in the order of recording (displaying).

The processing circuitry is further configured to find a best matching motion vector among candidate motion vectors. These candidate motion vectors form a motion vector search space in one of the respective reference pictures. The location of the respective search spaces in the two reference pictures RefPictO and RefPictl are determined by the initial motion vectors MVO and MV1 as shown in Figure 7. Thus, the initial motion vectors point to the positions of the search spaces in the reference picture. In the example of Figure 7, the MVO points to the center position of the search space formed by five positions: the center position and four directly adjacent positions on the top, bottom, to the left and right of the center position.

The best matching motion vector is determined by optimizing a function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples. The function F can be expressed by

F = f (T- mean(T), Ax - mean (Ax)),

In a different formulation, the equation can be written as follows

wherein the template (block) is denoted by T and the candidate block by Ax. The candidate block Ax is a block located on one of the positions in the respective search space (set of points pointed to by candidate motion vectors). Function F is a MRSAD function (thus, function f corresponds to SAD). Alternatively, function f may be any other similarity or dissimilarity measure such as sum of squared differences or the like. The processing circuitry is further adapted to derive the predictor for the current block based on block samples of a best-matching block pointed to by the best matching motion vector.

In some exemplary embodiments, the processing circuitry is adapted to derive the predictor for the current block without applying to the block samples a local illumination compensation (LIC).

It was shown in Figure 8 that the MRSAD function is applied when LIC (see also Figure 6) is performed. LIC requires that the MRSAD function is used in motion search. This means that whenever the LIC is signaled (or derived) to be applied to the current CU, MRSAD is used as the matching function. The MRSAD can be therefore thought of as an integral part of the LIC operation. However, in this embodiment of the invention, MRSAD is used for DMVR operation, but the LIC is not applied afterwards. Therefore it is a specific combination of the tools (a part of LIC (i.e. MRSAD) and DMVR).

In the apparatus according to this embodiment, motion vector refinement is performed by template matching. In particular, block-based motion vector refinement is applied. I.e., the template has the same shape and possibly also size as the current image block (e.g. a square of n c n pixels/samples). However, a subsampled block may be also used for template matching in order to further reduce the complexity.

In one exemplary implementation, the processing circuitry performs bi-prediction. This means that the template is determined as a weighted average of samples pointed to by two candidate motion vectors in two respective different reference pictures, corresponding to step 1060 shown in Figure 10. In particular, the template is obtained in step 1040 using a function of block A (determined by the first initial motion vector) and block B (determined by the second initial motion vector).

Moreover, the best-matching motion vector MVO’ and MVT is determined 1050 respectively in the first reference picture RefPictO and the second reference picture RefPictl by determining a best matching block for the template T. It is noted that the invention is applicable to the case where only MVO’ is obtained via template matching (therefore MVT is same as MV1 or MVT is obtained via a different process). The best matching block is a block in a reference picture having the greatest similarity given by function F (corresponding to lowest value in case of the SAD metric) among the positions of the search space defined in the reference picture by the initial motion vector. The predictor for the current block is derived from the block samples A’ and B’ pointed to by the refined best matching motion vectors MVO’ and MVT. The final predictor of the current block is then obtained by averaging (or applying a weighted average to) the best-matching blocks of the first reference picture and the second reference picture, respectively. The weighted averaging may use weights proportional to the distance between the current picture and the respective reference pictures RefPictO and RefPict 1 . Or the weighting factors for averaging might be signalled in the bitstream.

The blocks from which the template is determined, block A and block B belong to different reference pictures (RefPicO and RefPid ) which correspond to different time instances in the recording of the video. Due to this, the illumination level of the reference pictures might be different. In accordance with this embodiment, the blocks A and B are first normalized according to the average illumination level. Within the function F, the mean function is used to estimate the average illumination level. As a result the motion vector search operation is not affected by changes in the illumination level. As a result, this may increase the coding gain.

When looking at the above described Figures 1 and 2, the present disclosure may be implemented in the motion vector prediction of block 144 and 244. The processing circuitry described above may be a part of the encoder 100 or the decoder 200. Alternatively, the processing circuitry may also implement (embody) some or all of the units if the encoder 100 and/or decoder 200.

For instance, the function used for finding the best motion vector is the sum of absolute differences. Accordingly, as shown in Figure 10, in this case the function F can be expressed by

F= SAD (template - mean(template), block Ax - mean(block Ax)) for block Ax of the first reference picture (and analogously for block Bx of the second reference picture). The function is applied to all points x of a search space. When looking at Figure 7, five points x=1 , 2, 3, 4, 5 are tested by calculating SADs between the template T and respective blocks A1 , A2, A3, A4, and A5. The best matching block is then given as:

A’ = arg min F(Ax), calculated over all Ax blocks on the respective positions x of the search space. It is noted that Figure 7 shows a relatively small search space with 5 candidate positions. However, the present invention is not limited thereto and any shape and size of search space may be tested. The search space may include integer and/or fractional pixel positions. For instance, the search space may be defined as a square or rectangle K x L of samples adjacent in integer distance from the position pointed to by an initial motion vector; K and L being non zero integers. For instance, the estimate of the mean of the template is determined by averaging a subset of the samples included in the template, and estimate of mean of the candidate block samples by averaging a subset of said candidate block samples. However, the subset is not limited to a fraction of samples included in the template or respectively, the candidate block samples. The case of the complete set of samples, i.e. all samples of the template, or, respectively, all candidate block samples, is also possible. For instance, the samples of the subset can consist of every second sample of the block/ the candidate samples in both the vertical and the horizontal direction. Other subsampling implementations are possible as is clear to those skilled in the art. Employing the subset of samples smaller than all samples in the block provides the advantage of reduced number of operations necessary to calculate the mean.

Alternatively, the processing circuitry is configured to determine the estimate of mean of the candidate block samples by a function of samples different from the candidate block samples, but located in the same reference picture as the candidate block samples. These samples may be samples from one or two sample rows adjacent to the boundary of the candidate block on any one or more of the sides of the candidate block (especially the top side and/or the left side, in general the sides on which previously processed blocks in the scanning order are located).

As also already explained above, an advantageous embodiment regards bi-prediction. Accordingly, the processing circuitry is further configured to: find a first best matching motion vector among a first set of candidate motion vectors by optimizing a function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples, optionally, find a second best matching motion vector among a second set of candidate motion vectors by optimizing a function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples, and derive the predictor for the current block based on the block samples of the first and, if available, the second best matching block(s) pointed to respectively by the first and, if available, the second best matching motion vectors. For instance, the predictor for the current block is derived without applying to the block samples a local illumination compensation.

The first set of candidate motion vectors corresponds to a first search space in a first reference picture. The second set of candidate motion vectors corresponds to a second search space in a second reference picture different from the first reference picture. The deriving of the predictor for the current block based on the first and the second best-matching blocks without applying to the block samples a local illumination compensation may be performed by performing weighted averaging 1060 on the first best matching block and the second best matching block directly without processing them further, especially by processing (adjusting for) the mean (average illumination) of the best matching blocks. The first best matching motion vector is the best matching motion vector in the first reference picture. The second best motion vector is the best matching motion vector in the second reference picture.

In this embodiment, the step of finding a second best matching motion vector is optional. In one example of this embodiment, a second best matching motion vector is determined by refinement, and in another example, a second best matching motion vector is not determined by refinement. Instead, it may be determined by estimating the refinement based on the refinement obtained for the first best matching motion vector. However, this is not to limit the present disclosure. The second best motion vector may be obtained without refinement in any other way, too. As a second example the second best motion can be simply set equal to the non-refined second motion vector, which was obtained after the process 1030.

According to an embodiment, an apparatus for encoding a current image block is provided comprising the apparatus for determining a predictor of the current image block according to any example or embodiment described above. The apparatus for encoding the current image block further comprises an encoding controller configured to select whether the apparatus for determining the predictor is to apply:

- either the function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples,

- or the same function of the template and the candidate block samples pointed to by a candidate motion vector. Moreover, the apparatus encoding the current image block comprises a bitstream generator configured to generate a bitstream and to include therein as a syntax element an indicator indicating the result of the selection.

In correspondence with the above apparatus for encoding a current image, an apparatus for decoding a current image block from a bitstream is also provided. The apparatus for decoding a current image block comprises the apparatus for determining a predictor of the current image block according to any example or embodiment described above. The apparatus for decoding the current image block further comprises a bitstream parser for parsing from the bitstream a syntax element indicating a selection of whether the apparatus for determining the predictor is to apply: either the function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples, or the same function of the template and the candidate block samples pointed to by a candidate motion vector. Further, the apparatus for decoding the current image block comprises a decoder controller configured to control the apparatus for determining the predictor according to the selection.

The apparatuses described above may be a combination of a software and hardware. For example, the encoding and/or decoding may be performed by a chip such as a general purpose processor, or a digital signal processor (DSP), or a field programmable gate array (FPGA), or the like. However, the present invention is not limited to implementation on a programmable hardware. It may be implemented on an application-specific integrated circuit (ASIC) or by a combination of the above mentioned hardware components.

The encoding and/or decoding may also be implemented by program instructions stored on a computer readable medium. The program, when executed, causes the computer to perform the steps of the above described methods. The computer readable medium can be any medium on which the program is stored such as a DVD, CD, USB (flash) drive, hard disc, server storage available via a network, etc.

The above mentioned circuitry may also be a single integrated chip. However, the present invention is not limited thereto and the circuitry may include different pieces or hardware or a combination of hardware and software such as a general purpose processor or DSP programmed with the corresponding code.

Summarizing, the present disclosure relates to motion vector determination based on template matching for bi-directional motion vector estimation. In particular, a block template is constructed as an average of blocks pointed to by initial motion vectors to be refined. Then, the motion vector refinement is performed by template matching in two different reference pictures. The matching is performed by finding an optimum (minimum or maximum, depending on the function) of a matching function corresponding to the best matching block in each of the two reference pictures. The optimum is searched for zero- mean template and a zero-mean candidate block (among the block positions pointed to by the motion vector candidates of the search space). In other words, before performing the function optimization, a mean of the template is subtracted from the template and a mean of the candidate block is subtracted from the candidate block. The predictor of the current block is then calculated as a weighted average of the best matching blocks in the respective reference pictures.

Claims

1. An apparatus (144, 244) for determining a predictor of a current image block, the apparatus comprising a processing circuitry configured to: determine (1040) a template for the current block based on blocks pointed to by initial motion vectors (MV0, MV1 ) pointing to two different respective reference pictures (RefPictO, RefPictl ); find (1050) a best matching motion vector (MV0’, MVT) among candidate motion vectors by optimizing a function of a result of subtracting from the template (T) an estimate of mean of the template and a result of subtracting from candidate block samples (Ax) pointed to by a candidate motion vector an estimate of mean of said candidate block samples (Ax); derive (1060) the predictor for the current block based on block samples of a best matching block (A’, B’) pointed to by the best matching motion vector.

2. The apparatus according to claim 1 , wherein the processing circuitry is configured to derive (1060) the predictor for the current block without applying to the block samples a local illumination compensation.

3. The apparatus according to claim 1 or 2, wherein the processing circuitry is configured to: determine (1040) the template as a weighted average of samples pointed to by two candidate motion vectors in two respective different reference pictures; perform motion vector refinement (1050) of the initial motion vectors using the template; and derive the predictor from the block samples pointed to by the refined best matching motion vector.

4. The apparatus according to any of claims claim 1 to 3, wherein the function is sum of absolute differences.

5. The apparatus according to any of claims 1 to 4, wherein the processing circuitry is configured to determine: the estimate of mean of the template by averaging a subset of the samples included in the template; and/or the estimate of mean of the candidate block samples by averaging a subset of said candidate block samples.

6. The apparatus according to any of claims 1 to 4, wherein the processing circuitry is configured to determine the estimate of mean of the candidate block samples by a function of samples different from the candidate block samples, but located in the same reference picture as the candidate block samples.

7. The apparatus according to any of claims 1 to 6, wherein the processing circuitry is further configured to: find a first best matching motion vector among a first set of candidate motion vectors by optimizing a function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples, find a second best matching motion vector among a second set of candidate motion vectors by optimizing a function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples, and derive the predictor for the current block based on the block samples of a first best matching block pointed to by the first best matching motion vector and a second best matching block pointed to by the second best matching motion.

8. An apparatus for encoding a current image block comprising: the apparatus for determining a predictor of the current image block according to any of claims 1 to 7; an encoding controller configured to select whether the apparatus for determining the predictor is to apply:

- or the same function of the template and the candidate block samples pointed to by a candidate motion vector; and a bitstream generator configured to generate a bitstream and to include therein as a syntax element an indicator indicating the result of the selection.

9. An apparatus for decoding a current image block from a bitstream comprising: the apparatus for determining a predictor of the current image block according to any of claims 1 to 7; a bitstream parser for parsing from the bitstream a syntax element indicating a selection of whether the apparatus for determining the predictor is to apply:

- or the same function of the template and the candidate block samples pointed to by a candidate motion vector; and a decoder controller configured to control the apparatus for determining the predictor according to the selection.

10. A method for determining a predictor of a current image block, the method comprising the steps of: determining a template for the current block based on blocks pointed to by initial motion vectors pointing to two different respective reference pictures; finding a best matching motion vector among candidate motion vectors by optimizing a function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples; deriving the predictor for the current block based on block samples of a best-matching block pointed to by the best matching motion vector.

1 1. The method according to claim 10, wherein the predictor for the current block is derived without applying to the block samples a local illumination compensation.

12. The method according to claim 10 or 1 1 , further comprising: determining the template as a weighted average of samples pointed to by two candidate motion vectors in two respective different reference pictures; refining the initial motion vectors using the template; deriving the predictor from the block samples pointed to by the refined best matching motion vectors.

13. The method according to any of claims 10 to 12, wherein the function is sum of absolute differences.

14. The method according to any of claims 10 to 13, comprising the step of determining: the estimate of mean of the template by averaging a subset of the samples included in the template; and/or the estimate of mean of the candidate block samples by averaging a subset of said candidate block samples.

15. The method according to any of claims 10 to 13, comprising determining the estimate of mean of the candidate block samples by a function of samples different from the candidate block samples, but located in the same reference picture as the candidate block samples.

16. The method according to any of claims 10 to 15, comprising the steps of: finding a first best matching motion vector among a first set of candidate motion vectors by optimizing a function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples, finding a second best matching motion vector among a second set of candidate motion vectors by optimizing a function of a result of subtracting from the template an estimate of mean of the template and a result of subtracting from candidate block samples pointed to by a candidate motion vector an estimate of mean of said candidate block samples, and deriving the predictor for the current block based on the block samples of a first best matching block pointed to by the first best matching motion vector and a second best matching block pointed to by the second best matching motion vector without applying to the block samples a local illumination compensation.

17. A method for encoding a current image block comprising: determining a predictor of the current image block according to any of claims 10 to 1 6; selecting whether the apparatus for determining the predictor is to apply:

- or the same function of the template and the candidate block samples pointed to by a candidate motion vector; and generating a bitstream and to include therein as a syntax element an indicator indicating the result of the selection.

18. A method for decoding a current image block from a bitstream comprising: determining a predictor of the current image block according to any of claims 10 to 1 6; parsing from the bitstream a syntax element indicating a selection of whether the apparatus for determining the predictor is to apply:

- or the same function of the template and the candidate block samples pointed to by a candidate motion vector; and controlling the apparatus for determining the predictor according to the selection.

19. A computer-readable medium is provided storing instructions which when executed on a processor perform the steps of a method according to any of claims 10 to 18.