WO2020007748A1

WO2020007748A1 - Bilateral filter with lut avoiding unnecessary multiplication and minimizing the lut

Info

Publication number: WO2020007748A1
Application number: PCT/EP2019/067429
Authority: WO
Inventors: Jacob STRÖM; Per Wennersten; Jack ENHORN; Du LIU
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2018-07-02
Filing date: 2019-06-28
Publication date: 2020-01-09

Abstract

A method for applying bilateral filtering to a media object comprising a plurality of samples is provided. The method includes, for a current sample C, computing a filtered sample value IF based on one or more neighboring samples above (A), below (B), to the left (L), and to the right (R) of the current sample C. The filtered sample value I _F is given by (Formula I) where: I _C is the current sample intensity before filtering, ΔI _A, ΔI _B, ΔI _L, and ΔI _R are the differences, respectively, between the current sample intensity I _Cand the neighboring samples above, below, to the left, and to the right of the current sample, σ_d and σ_r are strength parameters; and d(σ_d) is given by d(σ_d) = (Formula II), and wherein computing I _F comprises using a lookup table with two or fewer dimensions.

Description

BILATERAL FILTER WITH LUT AVOIDING UNNECESSARY

MULTIPLICATION AND MINIMIZING THE LUT

TECHNICAL FIELD

[001] Disclosed are embodiments related to video compression and filtering.

BACKGROUND

[002] Bilateral filtering of image data directly after forming the reconstructed image block can be beneficial for video compression. As described by Wennersten et al., in“Bilateral Filtering for Video Coding” (referred to as [1] hereafter, and incorporated herein in its entirety), it is possible to reach a bit rate reduction of 0.5% with maintained visual quality for a complexity increase of 3% (encode) and 0% (decode) for random access. However, bilateral filtering involves a division, which can be expensive for hardware implementations. Therefore Wennersten et al. implemented this using a multiplication and a look-up table of 576 bytes [1] Later, a division-table-free variant of the bilateral filter form [1 ] was proposed in“Description of SDR, HDR and 360° video coding technology proposal by Qualcomm and Technicolor - low and high complexity versions” (JVET-J0021, ITU-T SG 16 EP 3 and ISO/IEC JTC l/SC 29/WG 11) (referred to as [2] hereafter).

[003] The filter weights in a bilateral filter depend on the image data, so they need to be calculated on-the-fly or obtained from a look-up-table (“LUT”). In the implementation in [1], 2 ,202 bytes were needed for this LUT. Another 576 bytes were needed for the division table, yielding 2,778 bytes in total for the solution in [1] The implementation proposed in [2] used a LUT of about 33,000 values.

SUMMARY

[004] Even with the division-free implementation, the bilateral filter is costly to implement for some forms of implementations, such as in a fully custom ASIC

implementation. In order to gain parallelism in such applications, a filter must typically be instantiated several times. This means that even a small look-up table of 2,202 bytes may be costly in terms of silicon area if it is instantiated, say, seven times. The same goes for multipliers within the filter. It is therefore of interest to further reduce the complexity of the filter in terms of LUT size and in terms of expensive operations such as multiplications.

[005] Several aspects of embodiments herein disclosed are now described at a high level. They are further described in detail below. Embodiments may include one or more of these aspects, including all of the aspects together or any other combination.

[006] LUT dimensionality reduction aspect

[007] As formulated in [2], the filter contribution from each surrounding pixel is calculated as a multiplication of three numbers; distance * range * D1. In [2], the first of these multiplications is avoided by fetching a pre -multiplied value of distance * range from a three- dimensional look-up table (LUT). This three-dimensional LUT becomes very big, around 33,000 values. Therefore a first aspect is to avoid this pre-multiplication. This means that a two- dimensional LUT can instead be used. The two-dimensional LUT is six times smaller. This solution reintroduces a multiplication; in some embodiments, this multiplication can be moved so that it is done only once per filtered pixel, yielding significant savings.

[008] Multiplication removal aspect

[009] A second aspect is to avoid the second multiplication between range and DI. This is done by pre-calculating this multiplication in the LUT instead of performing the multiplication in the filtering operation. Lortunately, this removal can be made without increasing the dimensionality of the LUT. This means that we can save four multiplications per filtered pixel. When taking into account the multiplication introduced in the first aspect, we therefore only need one multiplication per filtered pixel in total, whereas the solution in [2] may need up to four.

This is a substantial reduction.

[0010] LUT row reuse

[0011] The two-dimensional LUT we end up with depends on two parameters, the quantization parameter (qp) and the delta intensity DI. Thus one can write the full 2D LUT as a matrix where the different rows represent different qps and the different columns different d/s. However, as will be seen in the description below, two given rows in this matrix are quite similar to each other. Therefore, a third aspect is to approximate one row of the matrix using another row and a scaling transform. By doing so, the number of rows actually stored can be dramatically reduced, thereby lowering the size of the LUT by as much as a factor of 20. It should be noted that this way of reusing tables can be applied to the solution [2] as is, or to the solution [2] as modified by one or two of the two previously mentioned aspects, and it can also be applied to the solution in [1] An alternative use of this second aspect is to make each row smaller. As an example, if the longest row is 235 bytes, the scaling transformation can be used to get this down to smaller than 16. This makes it possible to implement the LUT using SIMD (single instruction, multiple data) instructions, which typically cannot handle LUTs larger than 16 elements.

[0012] Several additional advantages of one or more of these aspects are now described.

[0013] The bilateral filter is typically placed inside the intra-prediction loop. This means that when creating the intra prediction for the current block, filtered pixels from the previous block may be used. In the decoder, this means that the previous block may have to be fully reconstructed and filtered using the bilateral filter before we can start to construct the prediction for the current block.

[0014] This intra prediction can be in the critical path of the decoder. Inserting a filter into this path increases the latency for this critical path. This in turn means that the clock- frequency of the chip may need to be lowered, perhaps to a point where all pixels of a frame cannot be decoded in time. Thus it is of outmost important to be able to do this filtering as quickly as possible.

[0015] This typically means that we need more than one instantiation of the table. As an example, to filter a single pixel as quickly as possible, we would typically need four

instantiations of the LUT (the different instantiations of the LUT is described in a section below). Thus even if the LUT size is only 2,202 bytes, we would need 4*2,202=8,808 bytes to filter a single pixel quickly. There are alternative places for the filter, where it is not placed inside the intra-prediction loop. As an example, it can be placed as a loop-filter, for instance right before deblocking, after deblocking, or in parallel with sample adaptive offset filtering or in parallel with the adaptive loop filter stage. However, even in this case it will typically not be sufficient with a single instantiation of the LUT to process all of the pixels in a larger image, such as 4K or 8K resolution. Hence, also in this case several instantiations of the LUT would be necessary.

[0016] However, filtering a single pixel quickly may not be sufficient to lower latency.

The next block to the right may read any of the right-most pixels in the current block for its prediction. Therefore, all four pixels in the right-most column may need to be filtered as soon as possible. One way of doing this is to parallelize the filtering. Thankfully the bilateral filter is fully parallelizable. As an example, in a 4x4 block, all four pixels in the right-most column can be filtered in parallel without changing the result of the filtering. However, this comes at the cost of more LUT instantiations. If four LUT instantiations are needed for every filtered pixel, and four pixels need to be filtered simultaneously, then up to 4*4=16 instantiations of the LUT may be needed. This becomes 16*2,202 = 35,235 bytes. As is described in more detail below, it is possible to filter four pixels in a column using just 7 instantiations, but 7*2,202 = 15,414 bytes is still quite big and will cost silicon surface area. Also, as described below, seven multipliers would be needed to filter the four pixels when using the method described in [2], which may become troublesome in terms of size. It should again be noted that there are alternative places for the filter, other than inside the intra-prediction loop. As an example, it can be placed as a loop-filter, for instance right before deblocking, after deblocking, or in parallel with sample adaptive offset filtering or in parallel with the adaptive loop filter stage. However, even in this case it will typically not be sufficient with a single instantiation of the filter to process all of the pixel in a larger image, such as 4K or 8K resolution. Hence, also in this case several instantiations of the filter would be necessary. A typical implementation might even in this case need 16 instantiations of the LUT.

[0017] Without the first aspect, implementing [2] with 4-pixel parallelism would require

7*33,000=231 ,000 values to be stored for the LUTs. Using the first aspect in combination with [2], only about 7*33,000/6 = 38,500 values would be needed.

[0018] Without the second aspect, implementing [2] would require seven multipliers to filter four connected pixels in parallel. Using the second aspect an implementation would instead require only four multipliers. This is a substantial reduction. Furthermore, these multipliers may be smaller, i.e., using fewer bits in and out, which also translates to lower silicon area usage. [0019] With the third aspect, the size of the LUT can be further reduced. Applied to the implementation in [1], the size of an individual LUT may go down from 2,202 bytes to 200 bytes. If seven of these instantiations are needed, the combined LUT space may go down from 7*2,202=15,414 to 7*200=1 ,400 bytes. That is a substantial reduction.

[0020] All these reductions will lower the silicon area needed to implement bilateral filtering, saving cost. Furthermore, it is noted that CPU implementations will also benefit by not needing to do a multiplication.

[0021] As noted above, it is possible to put the bilateral filter outside the intra prediction loop, but inside the inter-prediction loop. (While this can lower the performance of the filter, it can still be beneficial to place it there due to latency requirements.) That means that when a block predicts from a previous block in the same image, it will use un- filtered data. This will put the filter outside the critical path for intra coding. This will make the complexity of the bilateral filter less critical, since only one instance of the filter may be needed, instead of seven or four. However, even in this case, it is of great benefit to have a low-complex filter, since this will translate to less silicon area even in this case. When predicting from a previously decoded block from a different image, filtered data will be used, putting it inside the inter prediction loop.

[0022] According to an embodiment, a method for applying bilateral filtering to a media object comprising a plurality of samples is provided. The method includes, for a current sample C, computing a filtered sample value I_F based on one or more neighboring samples above (A), below ( B ), to the left ( L ), and to the right (R) of the current sample C. The filtered sample value I_F is given by the equation

where:

I_c is the current sample intensity before filtering,

AI_A is the difference between the current sample intensity I_c and the intensity of the sample above ( 1_A ), such that AI_A = 1_A— l_c AI_B is the difference between the current sample intensity I_c and the intensity of the sample below (I_B), such that AI_B = I_B— I_c

D is the difference between the current sample intensity I_c and the intensity of the sample to the left (7_t), such that AI_L = I_L— 7_C;

AI_R is the difference between the current sample intensity I_c and the intensity of the sample to the right (l_{R ),} such that AI_R = I_R— I_c;

a_d is a spatial strength parameter;

a_r is an intensity strength parameter; and d(a_d) is given

wherein computing the filtered sample value I_F comprises using a lookup table with two or fewer dimensions.

[0023] In some embodiments, the lookup table is two-dimensional and depends on a_r and AI and where AI is an intensity difference AI_a, AI_b, AI_l, and/or AI_R . In some

A

embodiments, the lookup table is used to determine e ^{2 a}r where AI is an intensity difference AI_a, AI_B , AI_l, and/or AI_R . In some embodiments, the lookup table is used to compute an

A

influence function m_s?.(D/), where m_st(AΪ) = e ^2ar AI, such that the filtered sample value I_F is given by the equation

wherein computing the filtered sample value I_F involves one and only one multiplication per current sample C.

[0024] In some embodiments, the lookup table is one-dimensional and depends on AI and where AI is an intensity difference AI_a, AI_b, AI_l, and/or AI_R . In some embodiments, the lookup table is created based on a fixed value of a_r (a_r0 ) and when computing the filtered sample value I_F for a different value of a_r ( a_rl ), a scaling transform s(a_rl, AI) is applied using a constant c determined from a_r0 and a_rl . In some embodiments, the filtered sample value I_F is approximated using fixed point numbers. In some embodiments, I_F is approximated by I_F which is given by

» denotes arithmetic right shift and round (·) rounds to the nearest integer.

[0025] In some embodiments, using the lookup table comprises executing one or more single instruction, multiple data (SIMD) vector operations. In some embodiments, a size of a row of the lookup table is no more than 128 bits.

[0026] According to another embodiment, an encoder for applying bilateral filtering to a media object comprising a plurality of samples is provided. The encoder includes a computing unit configured to, for a current sample C, compute a filtered sample value I_F based on one or more neighboring samples above (A), below (B), to the left (Z), and to the right (R) of the current sample C, wherein the filtered sample value I_F is given by the equation

where:

I_c is the current sample intensity before filtering,

AI_A is the difference between the current sample intensity I_c and the intensity of the sample above ( 1_A ), such that AI_A = 1_A— I_c

AI_B is the difference between the current sample intensity I_c and the intensity of the sample below (I_B), such that AI_B = I_B— I_c

AI_R is the difference between the current sample intensity I_c and the intensity of the sample to the right (l_R ), such that Al_R = I_R— I_c;

a_d is a spatial strength parameter;

a_r is an intensity strength parameter; and _e ²¾ e^~2ad

given by d(ff_d) = - - — = - and

l+ e ^2ad+e ^2ad+e ^2ad +e ^2ad l+4e ^2ad

[0027] According to another embodiment, a decoder for applying bilateral filtering to a media object comprising a plurality of samples is provided. The decoder includes a computing unit configured to, for a current sample C, compute a filtered sample value I_F based on one or more neighboring samples above (A), below (B), to the left (Z), and to the right (R) of the current sample C, wherein the filtered sample value I_F is given by the equation

where:

4 is the current sample intensity before filtering,

AI_A is the difference between the current sample intensity I_c and the intensity of the sample above ( 1_A ), such that AI_A = 1_A— l_c

AI_B is the difference between the current sample intensity I_c and the intensity of the sample below (I_B), such that A1_B = I_B— I_c

D is the difference between the current sample intensity I_c and the intensity of the sample to the left (7_t), such that AI_L = I_L— I_c;

AI_R is the difference between the current sample intensity I_c and the intensity of the sample to the right (l_R ), such that AI_R = I_R— I_c;

a_d is a spatial strength parameter;

a_r is an intensity strength parameter; and e ¾ _e ¾

d(a_d) is given by d a_d) = - ₁ - _{1 1} — = -— , and

l+e ^2ad+e ^2ad_+e ^2ad +e ^2ad l+4e ^2ad

wherein computing the filtered sample value I_F comprises using a lookup table with two or fewer dimensions. [0028] According to another embodiment, a computer program comprising instructions which when executed by processing circuity of a node causes the node to perform the method of any one of the embodiments disclosed herein is provided..

[0029] According to another embodiment, a carrier containing the computer program of embodiments is provided, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.

BRIEF DESCRIPTION OF THE DRAWINGS

[0030] The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.

[0031] FIG. 1 illustrates a plot for various quality parameter (qp) values according to one embodiment.

[0032] FIG. 2 illustrates a plot according to one embodiment.

[0033] FIG. 3 is a flow chart illustrating a process according to one embodiment.

[0034] FIG. 4 is a flow chart illustrating a process according to one embodiment.

[0035] FIG. 5 is a diagram showing functional units of a node according to one embodiment.

[0036] FIG. 6 is a block diagram of a node according to one embodiment.

DETAILED DESCRIPTION

[0037] Throughout this description we will use filtering of intensity values as an example. This traditionally refers to the Y in YCbCr. However, it should be noted that this filtering can also be used for chroma values such as Cb and Cr, or any other components from other color spaces such as ICTCP, Lab, Y’u’v’ etc.

[0038] The original filter as described in [1] filters a pixel using

where I_F is the filtered pixel intensity and I_c is the center pixel intensity, i.e., the pixel intensity before filtering. The value AI_A is the difference between the center pixel intensity I_c and the intensity of the pixel above I_A, AI_A = I_A— I_c. Analogously, I_B , I_L and I_R are the intensities of the pixel immediately below, left and right of the center pixel respectively, and M_B ⁼ I_B— Ic _J M_L = — Ic, and AI_R = I_R— Ic-

[0039]

or R. This is equal t

( Eqn 2)

[0040] During the filtering of a block in [1], the variables o_d and o_r are kept constant.

However, the intensity differences AI_a, AI_b, AI_l, and Al_R are changing with every pixel, since they depend on the center intensity values of the center pixel I_c and the intensity values of the surrounding pixels ( I_A , I_B, I_L, and I_R). This means that the denominator in Equation 1 will be different for every pixel. To filter the pixel, a division is needed, and that can be implemented using a division table as described in [1]

[0041] In order to avoid this division, the filter is reformulated in [2] by simply

removing the factors containing e ^2ar from the denominator (but not from the numerator) of Equation 2. This results in

[0042] Since e ^{2 a}r is strictly smaller than 1, each term in the denominator will become a bit larger, and the larger denominator in Equation 3 will thus give a weaker filtering (less deviation from the original pixel I_c ) than Equation 2 would give given the same parameters a_d and a_r. By letting

it is possible to rewrite Equation 3 as

( Eqn 5)

A

[0043] Note that the variable d a_d) is denoted Distance_k in [2] Also, e ^2ar is

A

denoted Range_k in [2] By further defining

^{2 <7}r ,

Equation 5 is further simplified in [2] to

( Eqn 6)

[0044] Please note that upper case W is different from lower case w here; as

but

[0045] In [2], Equation 6 is written using a summation symbol, which in our notation becomes

[0046] There are two limitations associated with implementing the filter according to

Equation 6 or 9. First, it is noted that the (upper case) weight W_{Gd Gr} AI_x ) depends on three variables: a_d, a_r, and AI_X. In the implementation in [2], a_d can take on six different values, a_r can take on 34 different values and AI_X up to 1 ,023 different values. Thus if the weight

is stored as a look-up table (LUT), up to 6*34*1,023 = 208,692 values need to be stored. While this may not be much of a problem for a CPU-based implementation, it can be a serious problem for an implementation in full custom ASIC, where such a large LUT will translate to a sizable part of the available silicon area.

[0047] Typically, many of these values are small enough to be rounded to zero for a fixed precision, and they need not be tabulated. However, even when doing so, up to 33,000 values may be needed for the look-up table. This is in contrast to the implementation in [1 ], where only around 2,800 bytes are needed for the LUT and the division table. On the face of it therefore, the approximation used in [1 ] that avoids the division does not bring much in terms of savings. Instead, it looks like an increase in complexity, especially when it comes to the size of the LUT.

[0048] The second limitation with the implementation according to Equation 9 is that it mandates a multiplication inside the summation; the (upper case) weight W_{ad 7r} (AI_X) is multiplied with the intensity difference Al_x. Again, this may not be much of an issue for a CPU-based implementation, but for a full custom ASIC type implementation this can be expensive, especially if the filter needs to be instantiated several times to achieve parallelism.

[0049] Therefore, in one embodiment, we instead use a different implementation which is now described in detail. We go back to Equation 5, which is hereby repeated for the convenience of the reader,

(Eqn 5)

[0050] An important first aspect of some embodiments is that the factor d(a_d) is now extracted from the four terms,

[0051] An important second aspect of some embodiments is that we now define an influence function m_st (D7) that depends on a_r and D7 as

[0052] By using Equation 11 we can now rewrite the bracketed expression in Equation

10 as a sum of these influence terms. Equation 10 then simplifies to,

I_F = I_C + ά(s_a) * (m_st(DI_A) + m_st(DI_B) + m_st(DI_B) + m_sG(D7_k)). (Eqn 12)

[0053] We can now see that we have accomplished two things. The first aspect is that we no longer need the upper case weight W

that depends on three variables, and hence needs to be tabulated as a three-dimensional LUT. Instead we have separated out d a_d) from the other terms, removing the need of the LUT to depend on the a_d variable. Hence the

A

LUT can depend on two variables, s_t and D7. The second aspect is that we do not store e ^2or

A

in the LUT, but instead we tabulate the influence function m_s?.(D7) = e ^2ar D7, where the multiplication with D7 has already taken place.

[0054] As for the first aspect, when filtering according to Equation 9 as in [2], up to

6*34*1,023 = 208,692 values need to be stored. But when filtering according to the second aspect of embodiments using Equation 12, a maximum of 6 + 34*1 ,023 = 34,788 values need to be stored.

[0055] Above we saw that the implementations can save LUT space by storing only non-zero values. This way the implementation in [2] can also reach about 33,000 values. However, the same thing can be used here. A relevant accuracy for the influence value may be to use two fractional bits. In this case, the value will on average be zero for values of D7 larger than 132. Thus about 133*34 = 4,522 values need to be stored. This is significantly smaller than the 33,000 values needed using the implementation in [2] by a factor of 33,000/4,522 ~ 7.

[0056] The second thing that we have accomplished is that we have removed the need for several multiplications per filtered pixel as seen in Equation 12. Due to this second aspect we can avoid the four multiplications marked by

in Equation 6 used in [2] Comparing instead with [1], we can avoid the four multiplications marked by in Equation 1. Compared to both cases, this second aspect saves multiplications. Instead one multiplication per filtered pixel is introduced (the multiplication marked by

in Equation 12). Thus, instead of four multiplications per pixel, we are down to only one multiplication per pixel. This is a substantial reduction.

[0057] As is explained below, some computation can be shared between pixels. Thus on average, instead of four multiplications per filtered pixel, implementations of [1] and [2] would need about two multiplications per filtered pixel, if the techniques described below about sharing computations are used. However, using the second aspect of embodiments, only one multiplication per filtered pixel is needed. This is still a substantial reduction over the modified versions of [1] and [2]

[0058] It should be noted that this is not merely a question of implementation. In order to avoid decoder drift, it is essential that the encoder and the decoder get exactly the same result during filtering. Therefore it is not sufficient to state an approximate value of the weights when defining the video coding standard. These values must be defined exactly down to the last bit. Also operations such as rounding must be exactly defined. Therefore, a video coding standard using bilateral filtering inside any prediction loop (be it intra or inter prediction) must not only define the values of the weights used, but also the precision and where and how the rounding happens. As an example, in [2] the value of the upper case weight

will be represented as a fix point number. As an example, if 8 fractional bits are used, the actual value stored in the look-up table will be the integer

where round (·) rounds to the nearest integer. The filtered pixel will then be calculated as

where » denotes arithmetic right shift, and the factor 2⁷ is used to round evenly. However, in an embodiment, the filtered pixel is instead calculated as

[0059] Here we have used six fractional bits to represent d(a_d ) and two fractional bits to represent m_sg(AI), which is a realistic precision. Note that both I_F and I_F are

approximations of I_F, but crucially they are different approximations, since the components have been rounded differently. If the encoder uses I_F and the decoder uses I_F, there will be decoder drift and undefined behavior. Hence the same formula must be used in both cases, and it must be defined in the standard.

[0060] In some embodiments, the pixel being filtered will always have access to all its surrounding pixels at the time of filtering. However, in an embodiment where the filter is inside the intra prediction loop, as is the case for both [1] and [2], this cannot always be the case. As an example, assume we are filtering a block, and the block to the right has not yet been decoded. This means that a pixel situated on the right edge of the block will not have access to its right neighbor, since this neighbor belongs to the not-yet-decoded block. In this case the filtering will have to do with fewer surrounding pixels.

[0061] The solution to this situation used in [2] is to simply exclude this term in the calculation. Thus Equation 6 is changed to

in which the last term of Equation 6 has been removed.

[0062] The solution to this situation in [ 1 ] is different. Here the Equation 1 is still used, but w_R is set to zero which gives

w_&Al_& + w_BAI_B + w, D/,

Ip — Ir + {Eqn 18)

1 + w_A + _R + w,

[0063] Since the denominator now becomes smaller, the filter in [1] compensates for the lack of information in pixel R by trusting the remaining pixels A, B and L more. This should give a better filtering, and is different from what happens in Equation 17 used in [2], which will simply filter such a pixel less strongly. Continuing with Equation 18, we can follow the same steps of approximation as we did in Equation 2 and 3 and get

[0064] In this case, this gives a different value of

where we have a 3 in the denominator instead of a 4. Analogously, for comer pixels, where two surrounding pixels are missing, and we thus only have two neighbors, the corresponding value would be

Thus when filtering, we should ideally use d_edge{a_d ) when filtering edge pixels, d_corner{a_d) when filtering corner pixels and d{a_d ) otherwise. Doing this also gives a BD rate reduction (reduction in bit rate for the same quality) as compared to always using d{a_d), as is done in [2] However, if one would like to implement this using the prior art scheme from [2], this would be very expensive. The value stored in the look-up table in [2] is

and this would mean that we would need to create two more such LUTs,

Thus the total size of all the LUTs would increase by a factor of three. Thus in the prior art, the LUT size would go up from 6*34*1 ,023 = 208,692 to 3*6*34* 1023 = 626,076 values, or approximately 33,000*3 = 99,000 values if values close to zero are omitted.

[0065] In sharp contrast, accommodating this more accurate filtering using

embodiments herein disclosed results in no extra cost for the LUT table. Equation 12 can be used for pixels that are neither border pixel nor comer pixels, whereas for border pixels we can use

and for corner pixels we can use

The LUT for the influence values remains the same, and only the d{a_d) values must come in three versions, increasing their total number from six to 18. Thus the cost goes from 6 +

34*1 ,023 = 34,788 values to 18 + 34*1 ,023 = 34,800 values, a negligible change.

[0066] In another embodiment, we do not make use of the second aspect, i.e., we do not bake the multiplication into the LUT. Instead of storing the value m_s?.(D/) in the LUT, we store only the first part r_s?.(D/) (without multiplication by D7), given by A

R_sG(D7) = e ^2ar

The filtered pixel is then calculated as,

{Eqn 22) where we have highlighted the multiplications using ’ . In this case we will not get the benefit of having fewer multiplications, but we can still benefit from the first aspect, namely a smaller LETT size. This is due to the fact that r_s?.(D/) only depends on two variables ( a_r and AI) and therefore is a 2D LUT of the same size as the LUT from m_st (D7), and hence considerably smaller than the 3D LUT used for storing W_{ad ffr}(AI ).

[0067] For the third aspect we will first consider the original filter as described in [1]

It filters a pixel using Equation 1, repeated here for the convenience of the reader.

[0068] In [1], the weights are calculated as

where X can be A, B, L or R. This is equal to

Thus the Equation 1 above can be written as

[0069] It is noted in [1 ] that by changing the 1 in the denominator, it is possible to get the same result as if a_d had instead been changed. It is thus sufficient to tabulate w(a_d, s_t, D7) for a given a_d such as a_d = 0.82 and replace the 1 in the denominator when another a_d is needed.

[0070] Since a_d can be held constant in the LUT, it can now be made a two- dimensional LUT{a_r, AI ) = in(0.82, s_G, D7), i.e., it only depends on the variables AI and a_r. The value of a_r is calculated directly from the qp value according to a_r = 2 * ( qp— 17) for lO-bit data or a_r =

(qp— 17) for 8-bit data. Therefore we can equivalently say that the look up table is indexed using D7 and qp: w = LUT(qp, D7). Also, since all values in the LUT are smaller than 1, we need fractional resolution when representing the LUT using integers. In [1] 65 represents 1.0, which means that the following formula is used to calculate the LUT

where round (·) rounds to the nearest integer. For 10 bit values, the difference between two intensities can range from 0-1023 = -1023 to 1023-0 = 1023. However, since the formula contains a square, it is always true that LUT(qp,— D7) = LUT(qp,Al ,), so only positive values of DI need to be tabulated.

[0071] As an example, for the lowest qp allowed, qp=l8, the values of the LUT for

D7 = 0...1023 are 7,777(18, D7) = 31, 27,19,10,4,1,0,0,0,0,0,0,0,0,0, ... ,0.

[0072] It is noted in [1 ] that it is sufficient to store the first zero. This lowers the number of integers that need to be stored significantly. However, there is still a large number of integers to be stored. As an example, for qp = 35 we have LUT (34, D7) = 31,31,31,31,31,

31, 30, 30, 30, 30, 30, 29, 29, 29, 28, 28, 28, 27, 27, 26, 26, 26, 25, 25, 24, 24, 23, 23, 22, 21, 21,

20, 20, 19, 19, 18, 18, 17, 17, 16, 15, 15, 14, 14, 13, 13, 12, 12, 11, 11, 10, 10, 10, 9,9,8, 8,8,7,

7, 7, 6, 6, 6, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

0, ...,0.

[0073] Likewise, for qp=5l (the highest qp) we have 7,777(51, D7) = 31, 31, 31, 31, 31,

31,31,31,31,31,31, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 29, 29, 29, 29, 29, 29, 29, 28, 28, 28, 28, 28, 27, 27, 27, 27, 27, 26, 26, 26, 26, 26, 25, 25, 25, 25, 24, 24, 24, 24, 23, 23, 23, 23, 22, 22,

22, 21, 21, 21, 21, 20, 20, 20, 20, 19, 19, 19, 18, 18, 18, 18, 17, 17, 17, 17, 16, 16, 16, 15, 15, 15,

15, 14, 14, 14, 14, 13, 13, 13, 13, 12, 12, 12, 12, 11, 11, 11, 11, 10, 10, 10, 10, 10, 9,9,9, 9,9,8,

8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3,

3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1,1, 1,1, 1,1, 1,0, 0, ..., 0.

[0074] Plotting these into the same diagram, with D7 as the x-value and the weight as the y-value we get the plot illustrated in FIG. 1.

[0075] As shown in FIG. 1, it looks like the curve for qp = 51 is just a stretched version of that for qp = 34. This turns out to be exactly the case. The formula for the curve for qp =

34 is, according to Equation 24 (here we have omitted the rounding for convenience)

|D/ 1² (51 -17) ²

But 2*2^Z *(34— 17) ² can, if we multiply both numerator and denominator by— 7 73

(34—-17) ²,’ be written as

'-' (§÷¾¾ ,51 - 17,

A^I * ^34 - 17) |D/ * c\²

, ( Eqn 25) 2 * 2² * (34 - 17)² (51 - 17)^{2 _} 2 * 2² * (51 - 17)^{2 ~} 2 * 2² * (51 - 17)²

(34 - 17)² where c = ^{51 17}

34- 17 = 2. Hence we see that

[0076] Thus, instead of using LUT( 34, AI), we can get the same result by just taking every second value in LUT (51, AI) instead. (It is every second, since c happens to be exactly 2 in this case.)

[0077] This means that it is sufficient to store a one-dimensional LUT table, for instance LUT (51, AI). If we are interested in another value for qp, such as qp = 34, instead of storing another lD-row LUT (34, AI), we simply reuse the one for qp = 51 : LUT (34, AI) =

LUT (51, 2AI). This means that, instead of storing 3,468 values as in [1], we only need to store 197 values (the number of values in LUT (51, AI)). This is a reduction by a factor of 17.

[0078] However, we must also handle the c-values. It is not always the case that they will be as neat as in the above example, where c = 2. In theory the c-value can be calculated on the fly using

51 - 17

c = -— , ( Eqn 26)

qp— 17

but having a division inside the filtering should be avoided. Instead we could store the c- values using, for instance 8 bits; four bits for the integer part and four bits for the fractional part. This means that we would need to store 34 8-bit values for the c-values, and 197 values for LUT (51, D7). It is even possible to save further by using another table as the base table.

As an example, if qp=34 is the reference table (or base table), then only 99 values are needed. In total such an implementation would need 34 + 99*5/8 = 96 bytes of data compared to 2,202 bytes, a reduction by a factor of more than 20. This second aspect can be used even without reuse of the rows; as an example, it is possible to store one LUT row for every qp, but use every value twice in order to reduce the size of the LUT row. As an example, the LUT row used for qp 51 could be:

LUT^*{ 51, D/) = 31 , 31 , 31 , 31 , 31 , 31, 30, 30, 30, 30, 30, 29, 29, 29, 28, 28, 28, 27, 27, 26, 26, 26, 25, 25, 24, 24, 23, 23, 22, 21, 21, 20, 20, 19, 19, 18, 18, 17, 17, 16, 15, 15, 14, 14, 13, 13, 12, 12, 11 , 11 , 10, 10, 10, 9, 9, 8, 8, 8, 7, 7, 7, 6, 6, 6, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 1, 1 , 1, 1, 1, 1 , 1, 1, 1 , 1, 1, 1, 1 , 1, 0, ..., 0.

[0079] This is equivalent to LUT 34, D7) above. However, we would use a c-value of

so that the LUT-row would fit qp=51 rather than qp=34. Here 34 is the first term in the numerator since the table row in its original form was create to work for qp=34, the source qp. Also, 51 is the first term in the denominator since this is the qp that we want to use the LUT row for, the target qp. As can be seen the 7,ί7G^* (51, D7) (marked with a star) has only 98 non zero value, which is half the number compared to LUT (51, D7), which has 196 non-zero values. If this is done not only for qp 51 , but for all qps, it is possible to reduce the total number of LUT values by half without even reusing LUTs. In another embodiment the scaling can be used to set a maximum number for the largest number of element in a LUT row. Many decoders and encoders are implemented in software, and in order to have an efficient implementation it is well-known that SIMD operations are often used to speed up execution of the software. There are efficient SIMD operations for table look-up operations, but they typically have a restrictions on the number of entries that can be used in the LUT. A common way to implement LUT in SIMD is to put the entire LUT in a SIMD register, which can be 128 bits. If every entry is eight bits, that means that there is only room for 16 entries in the LUT. When decoding or encoding a block, only one LUT row at a time needs to be used, since a block is restricted to a single qp. However, this still puts a restriction that every LUT row cannot be more than 16 non-zero items. There are work-arounds for using more than 16 entries. As an example, two LUT operations with two different registers can be used to obtain the equivalent of a look-up from 32 elements. However, this costs one valuable register and one extra instruction, making it harder to make the code run quickly. Hence for maximum throughput, a maximum of 16 entries should be used per row. This can be done by selecting a c-value for every q-value so that the resulting table has at most 16 non-zero values (or, alternatively, 15 non-zero values).

[0080] One downside of an embodiment using a c-value is that the act of fetching a look-up table value suddenly seems to have become more complex, since it involves a multiplication. As an example, this is the previous pseudo code for fetching the weight:

// done once per block:

lookupTablePtr =

m_bilateralFilterTable [qp-18] ; // point to the right LUT, e.g., LUT[x,34] theMaxPos = maxPosList [qp-18] ; // find where the zeros start

// done several times per pixel:

weight = lookupTablePtr [min (theMaxPos, abs (deltal) ) ] ;

[0081] Now however, the idea illustrated above instead gives the following pseudo code:

// constant variables since they never change

lookupTablePtr = m singleTable; // always points to LUT[x,34] theMaxPos = 99; // the place where the zeros start

// done once per block

c value = m c valueTable [ qp- 18 ] ; // get the correct c-value

// done several times per pixel:

weight = lookupTablePtr [min (theMaxPos , (abs (deltal) * c value) » 4)];

[0082] For a software implementation, this increase in complexity is however only superficial: For a software implementation a LUT of 2,202 bytes is typically not very big, so it is possible to go back to a 2D-table using the following code at initialization time (i.e., only once when starting the software):

for(qp = 18; qp<52; qp++) c value = m c valueTable [qp-18 ] ;

for(dI = 0; dl < maxPosList [ qp- 18 ] +1 ; dl++)

m bilateralFilterTable [ qp- 18 ] [dl]= m sigleTable [min (theMaxPos ,

(abs (dl) *c_value) » 4)]

[0083] Then the regular 2D-LUT software (first example) can be used instead. So embodiments carry no penalty for a software implementation, but can save significantly in terms of LUT space for hardware implementations.

[0084] This saved LUT space must be traded off against the accuracy of the c-values.

In the case above, we use an 8 -bit value for the c-value, and the largest non-zero value (for the highest qp of 51) is 197 which is also contained in 8 bits. Thus the multiplication can be handled using an 8 bit times 8 bit multiplier. This multiplication will also consume surface area. If we had chosen a significantly higher accuracy of the c-value, the area consumed by the multiplication would be larger than the area saved by reducing the number of LUT values. It is therefore important that the accuracy of the c-value is set sufficiently low so that a substantial reduction results.

[0085] In another embodiment of the third aspect, we avoid the multiplication by putting some restrictions on the c-values. To understand how to do this efficiently, we look at the optimal c-values for qp 17 through 51 for the case when LUT(AI, 33) is the base table:

cl8 = 16.000000

cl9 = 8.000000

c20 = 5.333333

c21 = 4.000000

c22 = 3.200000

c23 = 2.666667

c24 = 2.285714

c25 = 2.000000

c26 = 1.777778 c27 = 1.600000

c28 = 1.454545

c29 = 1.333333

c30 = 1.230769

c3l = 1.142857

c32 = 1.066667

c33 = 1.000000

c34 = 0.941176

c35 = 0.888889

c36 = 0.842105

c37 = 0.800000

c38 = 0.761905

c39 = 0.727273

c40 = 0.695652

c4l = 0.666667

c42 = 0.640000

c43 = 0.615385

c44 = 0.592593

c45 = 0.571429

c46 = 0.551724

c47 = 0.533333

c48 = 0.516129

c49 = 0.500000

c50 = 0.484848

c5l = 0.470588

[0086] As can be seen in the table, quite a few of these (the ones in boldface) are pure powers of two. This means that if we restrict the c-value to be a power of two, these will still be well represented. Furthermore, in the beginning of the list, even the ones that are not powers of two are often immediately next to one that is a pure power of two. As an example, the c-value for qp 20 is 5.333, which is not a pure power of two, but in this case it is likely OK to use the c-value for qp 21 (which is 4) or the c-value for 19 (which is 8) instead. The reason for why this can work is that the mapping from qp-value to a_r is not something that is derived from some strict principle, but rather something that seems to work reasonably well. If something works reasonably well for qp 21 it may also work reasonably well for qp 20.

[0087] However, the lower half of the table is quite sparsely populated by c-values that are pure powers of two. Therefore it is likely going to hurt the performance somewhat.

However, this can be mitigated by having not one, but two or more base-tables. As an example, there is no power-of-two c-value between 34 and 47, but using a second base table at qp=4l would cut this long stretch in half. Also, it would cut the second longest stretch from 26 to 32 in half.

[0088] As an example, we have tested using two base tables LUT(32, x) and LUT(39, x) and c-values that are only powers of two. This gave a result where no significant BD-rate degradation could be measured for the filter. If this is done, c-values need to be stored differently, since they will no longer represent a multiplication but instead a bit shift. In the example above, bit shifts from -1 (division by two) to +4 (multiplication by 16) are used. This can be stored in 3 bits. If we have two base tables we also need to store an index to tell which base-table to use for a given qp. We call this the base_index. When we have only two base tables, the base index for each qp only needs one bit. Thus in total 4 bits of information need to be stored per qp, in total 4*34/8 = 17 bytes of information. Also, LUT(32,x) consists of 88 5-bit values, and LUT(39, x) consists of 128 5-bit values, giving another (88+128)^5/8 = 135 bytes. In total 152 bytes need to be stored compared to 2202 bytes in [1], a factor of over 14.

[0089] The pseudo-code to do the lookup would look something like this:

// constant variables since they never change

tablePtr[0] = m_baseTable32; // always points to LUT[x,32] tablePtr[l] = m_baseTable39 ; // always points to LUT[x,39] maxPosp[2] = {88,128}; // the place where the zeros start

// done once per block

c shift = m c valueTable [ qp- 18 ] ; // get the correct c-value lookupTablePtr = tablePtr [base index [ qp—18] ] ; // get the correct base table theMaxPos = maxPos [base index [ qp—18] ] ; // get the correct max-pos

// done several times per pixel: weight = lookupTablePtr [min (theMaxPos , (abs(deltal) « c shift)];

[0090] Again, it should be noted that one does not need to implement it this way; just as before it is possible to expand this to one 2D-table and use the simpler code to access it.

However, for a hardware implementation, this is very inexpensive to implement. In another embodiment we use four tables, for instance the tables for 50, 46, 42, and 38.

[0091] In an alternative embodiment, one may be interested in reducing the maximum size of a LUT to a fixed number, such as 16, in order to facilitate efficient SIMD

implementation. This can also be achieved using just shifts, since it is possible to use a smaller table and then do a right-shift to simulate c-values smaller than one. In this case, it is possible to use the following LUT rows, all of which are smaller than 16:

LUT^*(18,M) {255, 225, 155, 83, 35, 11, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, };

LUT^* (19, M {255, 247, 225, 192, 155, 117, 83, 55, 35, 20, 11, 6, 3, 1, 1, 0, };

LUT^* (20, M {255,246,215, 167, 117, 73,41,21,9,4, 1,0, 0, 0, 0, 0, };

LUT^* (21, M {255, 250, 231, 201, 164, 126, 91, 62, 39, 23, 13, 7, 3, 2, 1, 0, };

LUT^* (22, M {254, 239, 192, 132, 77, 39, 17, 6, 2, 1, 0, 0, 0, 0, 0, 0, };

LUT^* (23, M {255, 243, 209, 161, 111, 69, 38, 19, 8, 3, 1, 0, 0, 0, 0, 0, };

LUT^* (24, M {255, 246, 220, 182, 138, 97, 63, 37, 21, 10, 5, 2, 1, 0, 0, 0, };

LUT^* (25, M {255, 248, 228, 197, 159, 121, 87, 58, 37, 22, 12, 6, 3, 1, 1, 0, };

LUT^* (26, M ) {254, 232, 176, 109, 56, 24, 8, 2, 1, 0, 0, 0, 0, 0, 0, 0, };

LUT^* (27, M {254, 236, 188, 128, 74, 37, 16, 6, 2, 1, 0, 0, 0, 0, 0, 0, };

LUT^* (28, M {254, 239, 198, 144, 92, 51, 25, 11, 4, 1, 1, 0, 0, 0, 0, 0, };

LUT^* (29, M {254, 242, 206, 158, 108, 66, 36, 18, 8, 3, 1, 1, 0, 0, 0, 0, };

LUT^* (30, M {254, 244, 213, 169, 123, 81, 48, 26, 13, 6, 2, 1, 0, 0, 0, 0, };

LUT^* (31, M {254, 245, 218, 179, 136, 95, 61, 36, 20, 10, 5, 2, 1, 0, 0, 0, };

LUT^* (32, M {255, 246, 223, 187, 147, 107, 73, 46, 27, 15, 8, 4, 2, 1, 0, 0, };

LUT^* (33, M {255, 247, 226, 194, 157, 119, 85, 57, 36, 21, 12, 6, 3, 1, 1, 0, };

LUT^* (34, M {255, 248, 229, 201, 166, 130, 96, 68, 45, 28, 17, 9, 5, 3, 1, 1, };

LUT^* (35, M {253, 231, 174, 107, 55, 23, 8, 2, 1, 0, 0, 0, 0, 0, 0, 0, };

LUT^* (36, M {253,233, 180, 117,64, 29, 11,4, 1,0, 0, 0, 0, 0, 0, 0, }; LUT^* (37, AI) = {254, 235, 187, 126, 73, 36, 15, 5, 2, 1, 0, 0, 0, 0, 0, 0, } ;

LUT^* (38, AI) = {254, 237, 192, 135, 82, 43, 20, 8, 3, 1, 0, 0, 0, 0, 0, 0, };

LUT^* 39, I) = {254, 239, 197, 143, 91, 50, 25, 11, 4, 1, 1, 0, 0, 0, 0, 0, };

LUT^* 40, M) = {254, 240, 201, 150, 99, 58, 30, 14, 6, 2, 1, 0, 0, 0, 0, 0, };

LUT^* 41, M) = {254, 241, 205, 156, 107, 65, 36, 18, 8, 3, 1, 1, 0, 0, 0, 0, };

Lt/T* (42,D/) = {254, 242, 209, 162, 114, 73, 42, 22, 10, 4, 2, 1, 0, 0, 0, 0, };

LUT^*( 3, AI) = {254, 243, 212, 168, 121 , 80, 48, 26, 13, 6, 2, 1, 0, 0, 0, 0, };

LUT^* 44, M) = {254, 244, 215, 173, 128, 87, 54, 31, 16, 8, 3, 1, 1, 0, 0, 0, };

Lt/r (45, D7) = {254, 245, 217, 178, 134, 93, 60, 35, 19, 10, 5, 2, 1, 0, 0, 0, } ;

LUT^* 46, M) = {254, 245, 220, 182, 140, 100, 66, 41, 23, 12, 6, 3, 1, 1, 0, 0, };

LUT^* 47, M) = {254, 246, 222, 186, 146, 106, 72, 46, 27, 15, 8, 4, 2, 1, 0, 0, };

LUT^* 48, M) = {254, 247, 224, 190, 151, 112, 78, 51, 31, 18, 9, 5, 2, 1, 1, 0, };

LUT^* 49, I) = {254, 247, 225, 193, 156, 118, 84, 56, 35, 21, 12, 6, 3, 1, 1, 0, };

Lt/r (50, D/ ) = {254, 247, 227, 197, 160, 124, 90, 61, 40, 24, 14, 8, 4, 2, 1, 1, };

LUT^* 51, A1 ) = {255, 248, 229, 200, 165, 129, 95, 67, 44, 28, 16, 9, 5, 2, 1, 1, };

[0092] Now, if we want to calculate the LUT-row for, say, qp=5l, from these compressed LUT rows, we can do it through the equation:

where half_val(5l) = 8 and shift_val(51 ) = 4. This will give the same result as using the following LUT row:

LUT(S1, A1) = {255, 255, 255, 255, 255, 255, 255, 255, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 229, 229, 229, 229, 229, 229, 229, 229, 229, 229,

229, 229, 229, 229, 229, 229, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200,

200, 200, 200, 165, 165, 165, 165, 165, 165, 165, 165, 165, 165, 165, 165, 165, 165, 165, 165,

129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 95, 95, 95, 95,

95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67,

67, 67, 67, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 28, 28, 28, 28, 28, 28,

28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,

16, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 2,2,2, 2, 2,2,2, 2, 2, 2,2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,0,0, 0, 0,0, 0, 0, 0,0,0, 0, 0, };

[0093] As can be seen, this function is step-wise constant. When multiplying this with the D7 -value, the result will be a function that is piecewise increasing (since D7 is increasing but the LUT -value is constant), followed by discontinuities when the LUT -value goes down.

In order to avoid this, it is possible to make sure that the value we multiply the LUT value with is also constant at the same time. This can be done by multiplying by

instead of multiplying by D7. Hence, we can use

instead of LUT (51, D7) * D7.

[0094] This example was how to obtain the LUT row for qp=51. To get the LUT rows for other qps, we could use the following values for half_jval and shift_jval:

halfjval = {0, 0, 1, 1, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,

8, };

shift_val = {0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,

4,};

Note that, for the first two qps 18 and 19, we should not add anything (half_val = 0) and we should not shift any steps (shift_jval = 0). Also, as can be seen, the last value of half_val, representing C[p=51 , is 8, just as in the example above, and the last value of shiftjval, representing C[p=5l, is four, also in alignment with the example above.

[0095] In yet an alternative embodiment, it is possible to simplify the above by using

LUT (51, D7) * D7 and simply accept the fact that the resulting function is non-monotonic. In another embodiment, it may be possible to simplify the equations by setting half val to zero for all qps; then the extraction of a value becomes

LUT qp,AI ) = LUT*(qp,Al » shift _vaKqpf) and the multiplication with AI can be, for example,

LUT(qp, AI ) * [(D7 » shift_val(qp )) « shift _val(qp)\ if a monotonic function is needed, or LUT (51, D7) * D7 if a non-monotonic function is acceptable.

[0096] So far we have looked at this third aspect through the lens of the implementation of [1]· However, it is easy to see that this third aspect is also applicable to the filter as defined in [2] when we have applied the first aspect (going from 3D-LUT to 2D-LUT) but not the second (avoiding multiplication). As is described in conjunction with Equation 22, we then store r_s?.(D7) in the LUT. Expressing this in terms of qp gives

and it is easy to see that this is just the same as the expression in Equation 24, modulus a scaling factor of 65 * e²*o. ² . Hence the same math applies about to how to approximate one table row with the help of another using a c-value.

[0097] It is perhaps a bit trickier to see that this third aspect is also applicable when both the first and the second aspects are applied to [2] In that case, we store the influence factor m_s?.(D7) as described in Equation 11. Expressing the LETT in terms of qp instead of a_r gives

[0098] Plotting the rows for this LUT for all 34 different qp’s gives the plot illustrated by FIG. 2. From this plot, it is clear that only scaling along the x-axis will not work, since the functions have different heights.

[0099] However, the trick to multiply both the numerator and denominator in the

(51 -17)²

exponent by (34 -17)² shown in Equation 25 (repeated for convenience) works also here: \AI\ (<& ¾_{) L} ,51 - 17,

\(34— 17)², ^D/ * ^34 - 17^ |D7 * cY

{Eqn 25)

Thus

|D71² |A/*c| ²

LUT (34, AI) = AI * e 2.2 .(34-17 )² = D/ * e 2*2 5i-i7)². {Eqn 29)

However, in this case, the right hand side does not equal LUT (51, AI * c), because that is equal to

Instead, we have that

[00100] Thus, instead of scaling only in the x-direction, we need to scale both in the x and y direction with c and

respectively. If we use an arbitrary number c, we will need to store also - so that we can multiply the result by that factor afterwards. The source code could look something like this:

// constant variables since they never change

lookupTablePtr = m singleTable; // always points to LUT [x, 51] theMaxPos = 99; // the place where the zeros start

// done once per block

c value = m c valueTable [ qp- 18 ] ; // get the correct c-value inv c value = m inv c valueTable [ qp—18] ; // get the correct 1/c value

// done several times per pixel:

weight = inv c value * lookupTablePtr [min (theMaxPos , (abs(deltal) * c value) » 4 ) ] » 4 ;

[00101] Note that for a software implementation this could again be efficiently implemented by expanding the result back into a table. For hardware though, this saves valuable LUT space. A drawback however with this solution is that it incurs an extra multiplication again. [00102] Luckily, dividing by c is easy if c is a power of 2; if c=2^Ak, dividing by c is roughly equivalent to just shifting k bits. Hence in one embodiment of the third aspect, only powers of two are used, and the following pseudo code can be used to calculate the weight:

// constant variables since they never change

tablePtr [ 0 ] = m baseTable32 // always points to LUT [x, 32] tablePtr [ 1 ] = m baseTable39 // always points to LUT [x, 39] maxPosp [ 2 ] ^: 88 128 // the place where the zeros start

// done once per block

// done several times per pixel:

weight = lookupTablePtr [min (theMaxPos , (abs(deltal) « c shift)] » c shift;

[00103] The pseudo code above uses two table rows as base tables (32 and 39), but it is possible to use any number of rows as base table rows, including using four rows (e.g., 50, 46, 42, and 38) as base rows. Again, a software version can be much more efficiently rewritten. For a hardware implementation though this is valuable. This combines all three aspects; reducing the size of the LUT by reducing it to two dimensions, avoiding a multiplication, and finally reusing the rows of the LUT by scaling.

[00104] It should be noted that when tabulating the influence function

it is no longer true that LUT(qp,— D7) = LUT(qp, D7). This is because of the multiplication by D7, which changes the sign. However, it is instead true that LUT(qp,— D7) =—LUT(qp, D7), so it is still possible to only tabulate for positive A7s. For negative A7s we simply first negate, then fetch the value from the LUT, and then negate the fetched value.

[00105] LUT Instantiations

[00106] Here we go deeper into how many instantiations of the LUT we need in order to filter a single pixel. As an example, consider the filtering from [1] It filters a pixel using

W_AAI_A + W_B AI_B + W_L AI_L + W_RAI_R

Ip— l_r + (Eqn 1)

1 + w_A + w_B + w_L + w_R where I_F is the filtered pixel intensity and I_c is the center pixel intensity, i.e., the pixel intensity before filtering. The value AI_A is the difference between the center pixel intensity I_c and the intensity of the pixel above I_A, AI_A = 1_A— I_c. Analogously, I_B , I_L and I_R are the intensities of the pixel immediately below, left and right of the center pixel respectively, and

[00107] In this case, the weights w_A, w_B, w_L and w_R are potentially all different, and they depend on the intensity differences DI_A, DI_B, d/_L and DI_K . As an example, w_A is fetched from the LUT using w_A = LUT (qp , D1_A ^~) . However, w_B is fetched from a different part of the LUT using w_B = LUT (_jqp , DI_b) . Typically, a LUT implementation cannot read out two different values at once. Hence in this case we would need two LUT instantiations in order to be able to read out w_A and w_B in parallel. For the same reason, to get all weights we would need four LUT instantiations.

[00108] In the following we describe how it is possible to get away with using only seven instantiations of a LUT when filtering a row or a column of four pixels in parallel.

[00109] Assume we need to filter the right-most column in the following 4x4 block.

Here we have denoted the top right pixel with“T”, the pixel to its left with“S” etc. The intensity of pixel T is denoted I_T, the intensity of pixel S is denoted I_s etc.

[00110] To filter pixel T we first calculate the intensity difference between its neighboring pixels S and T: DI_T3 = I_s— I_T and DI_tg = l_v— I_T . We can now use \D1_T3 \ to get the weight w_TS = LUT(qp, \DI_T3\ ) and likewise for the weight w_TV = LUT(qp, \DI_tn \). The filtered pixel T can now be calculated using

where w_c is a center weight that is constant for the block and therefore does not need to be looked up.

[00111] Next we filter pixel V. To do so, we need to calculate the weights for the three surrounding pixels T, U and X. We start with the above pixel, DI_nt = I_T— I_v, and we can now calculate the weight w_VT = LUT (\D1_nt \, qp ). However, since DI_nt = I_T— l_v =—(J_v— 7_T) = —DI_tn, this means that \DI_nt \ = \DI_tn\, and hence

W_yj· = TuT{\D1_nt \, qp) = LUT {\DI_gg\, qp^') = w_TV that we already looked up above. Hence we can reuse w_TV and we don’t need another instantiation of the LUT. For the other two weights w_vu and w_vx we need instantiations of the LUT, bringing the number to four so far.

[00112] Likewise, when filtering pixel X, we can use w_xv = w_vx and only need two more instantiations for w_xw and w_xz, bringing the total so far to six.

[00113] For the last pixel, we can use w_zx = w_xz and we only need one more instantiation for w_ZY , bringing the total number of LUT instantiations needed to 7.

[00114] It should also be noted that the product w_VTAI_VT = w_TV(—AI_TV) =—w_TVAI_TV. Hence it is possible to save not only the look-up of w_VT but it is also possible to avoid the multiplication w_VTAI_VT and instead replace it with a negation of the previously calculated value w_TVAI_TV. This means that to filter the four pixels T, V, X and Z, a total of seven multipliers must be used if using the method described in [2] If only one multiplication is needed per filtered pixel, as when using the second aspect, only four multipliers would be needed.

[00115] Seen another way, even in the case of [1] and [2], it is possible to save some multiplications. This is due to the fact that the difference between a center pixel and its right neighbor can be reused when the right neighbor in the next step is the center pixel. In detail, if 7(34,40) is the intensity of the pixel at x = 34, y = 40, and 7(35,40) is the intensity of the pixel immediately to the right, then AI_R (34,40) = (7(35,40)— 7(34,40)) when the center pixel is in position (34,40). When the center pixel is instead in position (35,40),

AI_L( 35,40) = (7(34,40)— 7(35,40)) = — AI_R (34,40). A similar reduction can also be made with the top and bottom pixels; D7_T (34,41) = (7(34,40)— 7(34,41)) = — D7_B (34,40).

However, this only saves two multiplications per pixel, resulting in two remaining

multiplications per pixel. In contrast, embodiments disclosed herein allow the same filtered pixel value to be calculated using only one multiplication per pixel.

[00116] FIG. 3 illustrates a process 300 of applying bilateral filtering to a media object comprising a plurality of samples. For each sample C in a media object, of the plurality of samples (step 302), it is determined if one or more neighbors of sample C that are above (A), below (B), to the left (L) or to the right (R) of sample C are available (step 304). As explained above, in some embodiments, the sample being filtered will always have access to all its surrounding samples at the time of filtering. However, in other embodiments, this cannot always be the case. As an example, assume we are filtering a block, and the block to the right has not yet been decoded. This means that a pixel situated on the right edge of the block will not have access to its right neighbor, since this neighbor belongs to the not-yet-decoded block. In this case the filtering will have to do with fewer surrounding samples. If all samples are available (as determined at step 306), then the filtered sample for C is computed using the available samples A, B, L, and R. If not all samples are available (as determined at step 306), then the filtered sample for C is computed using a subset of A, B, L, and R for the samples that are available (step 310). For example, step 310 may use d_edBe (c¾) when filtering edge pixels and d_cornerC^) when filtering comer pixels, whereas step 308 may use d(a_d).

[00117] FIG. 4 illustrates a process 400 of applying bilateral filtering to a media object comprising a plurality of samples. The method includes, for a current sample C, computing a filtered sample value I_F based on one or more neighboring samples above (A), below (B), to the left ( L ), and to the right (R) of the current sample C (step 402). The filtered sample value I_F is given by the equation

where: I_c is the current sample intensity before filtering,

AI_A is the difference between the current sample intensity I_c and the intensity of the sample above (I_A), such that AI_A = 1_A— I_c

D is the difference between the current sample intensity I_c and the intensity of the sample to the left (I_L), such that Al_h = I_L— I_c

AI_R is the difference between the current sample intensity I_c and the intensity of the sample to the right (I_R ), such that AI_R = I _R— I_c a_d is a spatial strength parameter; a_r is an intensity strength parameter; and e ¾ _e ¾

d(a_d) is given by d(a_d) =— Computing the l + e ^2ad+e ^2ad_+e ^2ad +e ^2ad l+4e ^2ad

filtered sample value I_F comprises using a lookup table with two or fewer dimensions (step 404).

[00118] In embodiments, the lookup table is two-dimensional and depends on a_r and AI and where AI is an intensity difference AI_a, AI_b, AI_l, and/or AI_R . In embodiments, the lookup

table is used to determine e ^2ar where AI is an intensity difference AI_a, AI_b, AI_l, and/or AI_R .

In embodiments, the lookup table is used to compute an influence function m_sG(D/), where

such that the filtered sample value I_F is given by the equation

[00119] In embodiments, the lookup table is one-dimensional and depends on AI and where AI is an intensity difference AI_a, AI_b, AI_l, and/or AI_R . In embodiments, the lookup table is created based on a fixed value of s_t (s_t0 ) and when computing the filtered sample value I_F for a different value of s_t (s_Gΐ), a scaling transform s(a_rl, AI) is applied using a constant c determined from a_r0 and a_rl. In embodiments, the filtered sample value I_F is approximated using fixed point numbers. In embodiments, I_F is approximated by I_F which is given by

wherein

where » denotes arithmetic right shift and round (·) rounds to the nearest integer.

[00120] In embodiments, I_F is approximated by I_F which is given by

and

[00121] In embodiments, said bilateral filtering is applied during encoding and/or decoding of the media object. In embodiments, using the lookup table comprises executing one or more single instruction, multiple data (SIMD) vector operations, and in some embodiments, a size of a row of the lookup table is no more than 128 bits. [00122] FIG. 5 is a diagram showing functional units of node 502 (e.g. an encoder/decoder) for applying bilateral filtering to a media object comprising a plurality of samples, according to an embodiment. Node 502 includes a computing unit 504. Computing unit 504 is configured to, for a current sample C, compute a filtered sample value I_F based on one or more neighboring samples above (A), below (B), to the left (Z), and to the right (R) of the current sample C,

wherein the filtered sample value I_F is given by the equation

where:

I_c is the current sample intensity before filtering,

AI_A is the difference between the current sample intensity I_c and the intensity of the sample above (1_A ), such that AI_A = 1_A— I_c

D is the difference between the current sample intensity I_c and the intensity of the sample to the left (I_L), such that AI_L = I_L— I_c

AI_R is the difference between the current sample intensity I_c and the intensity of the sample to the right (I_R), such that Al_R = I_R— I_c;

a_d is a spatial strength parameter;

a_r is an intensity strength parameter; and e ¾ _e ¾

d(a_d) is given by d(a_d) = - - 1 — = -— . Computing the l+e^{~ 2a}d +e^{~ 2a}d _+e ^{~ 2<T}d +e^{~ 2<T}d l+4e ^2<Td

filtered sample value I_F comprises using a lookup table with two or fewer dimensions.

[00123] FIG. 6 is a block diagram of node 502 (e.g., an encoder/decoder) for applying bilateral filtering to a media object comprising a plurality of samples, according to some embodiments. As shown in FIG. 6, node 502 may comprise: processing circuitry (PC) 602, which may include one or more processors (P) 655 (e.g., a general purpose microprocessor and/or one or more other processors, such as an application specific integrated circuit (ASIC), field-programmable gate arrays (FPGAs), and the like); a network interface 648 comprising a transmitter (Tx) 645 and a receiver (Rx) 647 for enabling node 502 to transmit data to and receive data from other nodes connected to a network 610 (e.g., an Internet Protocol (IP) network) to which network interface 648 is connected; and a local storage unit (a.k.a.,“data storage system”) 608, which may include one or more non-volatile storage devices and/or one or more volatile storage devices. In embodiments where PC 602 includes a programmable processor, a computer program product (CPP) 641 may be provided. CPP 641 includes a computer readable medium (CRM) 642 storing a computer program (CP) 643 comprising computer readable instructions (CRI) 644. CRM 642 may be a non-transitory computer readable medium, such as, magnetic media (e.g., a hard disk), optical media, memory devices (e.g., random access memory, flash memory), and the like. In some embodiments, the CRI 644 of computer program 643 is configured such that when executed by PC 602, the CRI causes node 502 to perform steps described herein (e.g., steps described herein with reference to the flow charts). In other embodiments, node 502 may be configured to perform steps described herein without the need for code. That is, for example, PC 602 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software.

[00124] While various embodiments of the present disclosure are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.

[00125] Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.

Claims

CLAIMS:

1. A method for applying bilateral filtering to a media object comprising a plurality of samples, the method comprising: for a current sample C, computing a filtered sample value I_F based on one or more neighboring samples above (A), below (B), to the left ( L ), and to the right (R) of the current sample C,

wherein the filtered sample value I_F is given by the equation

where:

I_c is the current sample intensity before filtering,

AI_R is the difference between the current sample intensity I_c and the intensity of the sample to the right (I_R), such that Al_R = I _R— I_c; a_d is a spatial strength parameter; a_r is an intensity strength parameter; and d(a_d) is given

2. The method of claim 1 , wherein the lookup table is two-dimensional and depends on a_r and AI and where AI is an intensity difference AI_a, AI_b , AI_l, and/or AI_R .

3. The method of any one of claims 1 -2, wherein the lookup table is used to determine A

e ² r where D7 is an intensity difference AI_a, AI_b , AI_l, and/or AI_R .

4. The method of any one of claims 1 -2, wherein the lookup table is used to compute an influence function m_s?. (D7), where

such that the filtered sample value I_F is given by the equation

5. The method of any one of claims 1, 3, and 4, wherein the lookup table is one dimensional and depends on AI and where AI is an intensity difference AI_a, AI_b, AI_l, and/or AI_R .

6. The method of claim 5, wherein the lookup table is created based on a fixed value of s_g (s_t0 ) and when computing the filtered sample value I_F for a different value of s_t (s_Gΐ), a scaling transform s(a_rl, AI) is applied using a constant c determined from a_r0 and a_rl.

7. The method of any one of claims 1-6, wherein the filtered sample value I_F is approximated using fixed point numbers.

8. The method of claim 6, wherein I_F is approximated by I_F which is given by

and

9. The method of any one of claims 1-8, wherein using the lookup table comprises executing one or more single instruction, multiple data (SIMD) vector operations.

10. The method of claim 9, wherein a size of a row of the lookup table is no more than 128 bits.

11. An encoder (502) for applying bilateral filtering to a media object comprising a plurality of samples, the encoder (502) comprising: a computing unit (504) configured to, for a current sample C, compute a filtered sample value I_F based on one or more neighboring samples above (A), below (B), to the left ( L ), and to the right (R) of the current sample C,

wherein the filtered sample value I_F is given by the equation

where:

I_c is the current sample intensity before filtering,

M_L is the difference between the current sample intensity I_c and the intensity of the sample to the left (4), such that AI_L = I_L— I_c

AI_R is the difference between the current sample intensity I_c and the intensity of the sample to the right (I_R), such that AI_R = I _R— I_c

a_d is a spatial strength parameter;

a_r is an intensity strength parameter; and d(a_d) is given

12. The encoder (502) of claim 11 , wherein the lookup table is two-dimensional and depends on a_r and AI and where D/ is an intensity difference AI_a, AI_b, AI_l, and/or AI_R .

13. The encoder (502) of any one of claims 1 1-12, wherein the lookup table is used to A

determine e ^2ar where AI is an intensity difference AI_a, AI_b, AI_l, and/or AI_R .

14. The encoder (502) of any one of claims 1 1-12, wherein the lookup table is used to compute an influence function m_st (D7) , where

such that the filtered sample value I_F is given by the equation

15. The encoder (502) of any one of claims 11-14, wherein said computing unit is implemented in an Application-Specific Integrated Circuit (ASIC).

16. The encoder (502) of any one of claims 11-14, wherein using the lookup table comprises executing one or more single instruction, multiple data (SIMD) vector operations.

17. A decoder (502) for applying bilateral filtering to a media object comprising a plurality of samples, the decoder comprising: a computing unit (504) configured to, for a current sample C, compute a filtered sample value I_F based on one or more neighboring samples above (A), below (B), to the left ( L ), and to the right (R) of the current sample C,

wherein the filtered sample value I_F is given by the equation

where:

I_c is the current sample intensity before filtering, AI_A is the difference between the current sample intensity I_c and the intensity of the sample above (7^), such that AI_A = 1_A— I_c

AI_B is the difference between the current sample intensity I_c and the intensity of the sample below (7_B), such that AI_B = I_B— I_c

M_L is the difference between the current sample intensity I_c and the intensity of the sample to the left ( I_L ), such that AI_L = I_L— I_c

AI_R is the difference between the current sample intensity I_c and the intensity of the sample to the right (l_{R ),} such that AI_R = I _R— I_c

a_d is a spatial strength parameter;

a_r is an intensity strength parameter; and d(a_d) is given

18. The decoder (502) of claim 17, wherein the lookup table is two-dimensional and depends on o_r and D7 and where D7 is an intensity difference AI_a, AI_b, AI_l, and/or AI_R .

19. The decoder (502) of any one of claims 17-18, wherein the lookup table is used to A

determine e ^2ar where D7 is an intensity difference AI_a, AI_b, AI_l, and/or AI_R .

20. The decoder (502) of any one of claims 17-18, wherein the lookup table is used to compute an influence function m_st (D7) , where

such that the filtered sample value I_F is given by the equation

21. The decoder (502) of any one of claims 17-20, wherein said computing unit is implemented in an Application-Specific Integrated Circuit (ASIC).

22. The decoder (502) of any one of claims 17-20 wherein using the lookup table comprises executing one or more single instruction, multiple data (SIMD) vector operations.

23. A computer program comprising instructions which when executed by processing circuitry (602) of a node (502) causes the node (502) to perform the method of any one of claims 1-10.

24. A carrier containing the computer program of claim 23, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.