WO2020053262A1

WO2020053262A1 - Hadamard piecewise linear approximation

Info

Publication number: WO2020053262A1
Application number: PCT/EP2019/074207
Authority: WO
Inventors: Jacob STRÖM; Per Wennersten; Jack ENHORN; Du LIU
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2018-09-13
Filing date: 2019-09-11
Publication date: 2020-03-19

Abstract

There are provided mechanisms for filtering of a sample. The method comprises obtaining a quantization parameter qp associated with said sample. The method comprises generating transform coefficients by applying a Hadamard transform to an area comprising said sample and at least one sample surrounding said sample. The method further comprises obtaining, based on qp, a filtered transform coefficient from a transform coefficient x using a piecewise linear function y with n ≥ 2 pieces. The method comprises generating transformed samples by applying an inverse Hadamard transform on the filtered transform coefficients. The method comprises obtaining a filtered version of said sample based on at least one of said transformed samples.

Description

HADAMARD PIECEWISE LINEAR APPROXIMATION

TECHNICAL FIELD

[001] Disclosed are embodiments related to video compression and filtering.

BACKGROUND

[002] Hadamard transform domain filter (HTDF) has been proposed in Stepin et al.

[1], as a filtering step in video encoding and decoding. HTDF is proposed to be applied to the reconstructed video data to reduce noise. It therefore occupies the same place in the video decoding chain as the bilateral filter described by Wennersten et al. [2] did in the joint exploration model, JEM, and it is proposed to replace, rather than be used in conjunction with, the bilateral filter.

[003] First, the HTDF uses the Hadamard transform to convert the pixels into the transform domain. To filter a pixel intensity value, also known as a sample value i₀, the surrounding intensity valules i_t, i₂ and i₃ are also used:

[004] The Hadamard transform coefficients are then calculated as :

yO = iO + i2

yl = il + i3

y2 = iO - i2

y3 = il— i3

R0 = yO + yl (Equation 0)

Rl = yO - yl

R2 = y2 + y3

R3 = y2 - y3

[005] Then, the HTDF filters the transform coefficients. Finally, the HTDF transforms the filtered coefficients back to the pixel domain using an inverse Hadamard transform. It is shown in [1 ] that HTDF provides 0.50% of bitrate saving with increased complexity of 5% (encode) and 4% (decode) for random access compared to VTM 1.0.

[006] The implementation in [1] uses a look-up table (LUT) to store the filtering results. The LUT described in [1] filters a pixel according to the following equation: (Equation 1)

where (Equation 2)

Here, R(i) is the spectrum component of the Hadamard transform domain, i.e., R( 0) should be identified with R0 above, R(l) with Rl , etc., of Equation 0. In some instances, the threshold (“THR”) is set to 128, and s may be provided as one of the following:

a = 2(¹+o-i26*(qp-27))_^ (Equation 3) or s = 2 · 2.64 · 2°·¹²⁶⁹6^?r-11). (Equation 4)

However, since using Equation 1 would change the sign for negative R(i) s, an improved version of Equation 1 can be provided as shown below:

[007] In addition to introducing the inner minus sign in— Lt/7(— K(ί), s), Equation 5 also changes the place where the threshold occurs, i.e., the top line uses Abs

THR rather than Abs (i?(i)) > THR as in Equation 1. The reason for this is that if THR is a power of two, such as THR=l28 as in [1], using Equation 1 would make it necessary to store 129 values (0, 1 , 2, ..., 128) in the LETT. Normally, it is desirable to have a power-of-two numbers of items in a LUT, and this can be achieved by using Abs (R( ) ³ THR . [008] The LUT in [1] has two dimensions, where one dimension corresponds to different qp s and the other dimension corresponds to different transform coefficient values. As an example, there may be one row of the LUT for each qp value, and each row may contain THR values mapping an unfiltered value to a filtered value. The qp values range from 18 to 63, and the filtering is applied to values in [0,127] Accordingly, the LUT may consist of 46 rows with 128 values in each row.

[009] For example, for qp = 37, the LUT-row used for filtering may be as follows:

LUT37 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 8, 8, 9, 10, 10, 11, 12, 13, 14, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 44, 45, 46, 47, 48, 49, 50, 51, 52, 54, 55, 56, 57, 58, 59, 60, 61, 63, 64, 65, 66, 67, 68, 69, 70, 72, 73, 74, 75, 76, 77, 78, 79, 80, 82, 83, 84, 85, 86, 87, 88, 89, 91, 92, 93, 94, 95, 96, 97, 98, 99, 101, 102, 103, 104, 105, 106, 107, 108]

SUMMARY

[0010] In [1], the Hadamard LUT has a size of 46x128=5888 entries, where 46 is the number of different qp values and 128 is the number of different transform coefficient values. In [1] each LUT entry is of an int32 type, meaning that it requires four bytes. The total number of bytes needed for the LUT therefore becomes 23552 bytes. However, even if a single byte is used for each entry, this still amounts to 5888 bytes. For full custom ASIC implementations, many copies of the Hadamard filter may be needed in order to increase parallelism. Firstly, one needs to filter four coefficients per pixel, and this means four instantiations of the LUT if one wants to do this in parallel, meaning 23552 bytes. Furthermore, if eight pixels need to be filtered in parallel, this would amount to 23552*8 = 188416 bytes. This is costly to implement. It is therefore of interest to reduce the complexity of the filter in terms of LUT size.

[0011] Another problem is that the filter given in Equations 1 and 5 is discontinuous at the THR threshold point. For example, assuming a THR value of 128, this would mean that when R(i^') < 128, the filtered coefficient is smaller than R(i). However, for R(i) > 128, the output is always R(Q. Exactly before the discontinuity, at R(i) = 127, we have F(i, s ) =

whereas right after the discontinuity, at R(i) = 128, we have

(127)³

F(i, s ) = R(i) = 128. For larger qp values, this gap (127)² +s² - 128 gets larger. This could introduce a discontinuity in the pixel domain and the effect of the discontinuity may be visible.

[0012] A third problem is the obstacle of an efficient software implementation. It is known to a person skilled in the art that, for code to be able to run efficiently on a CPU, it is important to make it compatible with single instruction, multiple data (“SIMD”) instructions that are available on modem CPUs. The reason is that a SIMD instruction can carry out many parallel operations in a single instruction. As an example, a regular CPU-instruction may be able to add two numbers together in one clock cycle. In contrast, a SIMD-instruction may be able to add eight numbers to eight other numbers in one clock cycle. This means that it may be possible to make the code run eight times faster.

[0013] SIMD instructions, however, are not particularly well suited for look-up table operations, at least not if the look-up table is large. If the look-up table is small enough to fit in a single SIMD register, then efficient LUT operations are possible. As an example, if the SIMD registers are 128 bits wide, then it is possible to fit 16 eight -bit numbers in a SIMD register. It may then be possible to do, say, eight parallel look-ups from this small look-up table. However, if the look-up table is 32 eight-bit numbers, two such SIMD operations may be needed. If the table length is 128 eight-bit numbers, eight such SIMD operations may be needed. But if we need eight instructions to carry out eight parallel look-ups, we may not have gained much compared to carrying out eight regular (i.e., non-SIMD) instructions, each of which can do a LUT from 128 numbers. It may therefore not be possible to speed up the code using SIMD operations.

[0014] Certain aspects of the present disclosure and their embodiments may provide solutions to the aforementioned problems. One aspect of the proposed solution is to have a significant reduction of items that have to be stored. Another one is a highly efficient SIMD implementation.

[0015] The proposed solutions disclosed herein for at least the problems noted above approximate the filtering equation that is currently implemented by a LUT in [ 1 ] by qp - dependent piecewise linear functions, such that the lookup table can be reduced significantly or even removed completely. [0016] According to a first aspect of the embodiments, there is provided a method for filtering of a sample. The method comprises obtaining a quantization parameter qp associated with said sample. The method comprises generating transform coefficients by applying a Hadamard transform to an area comprising said sample and at least one sample surrounding said sample. The method further comprises obtaining, based on qp, a filtered transform coefficient from a transform coefficient x using a piecewise linear function y with n ³ 2 pieces. The method comprises generating transformed samples by applying an inverse Hadamard transform on the filtered transform coefficients. The method comprises obtaining a filtered version of said sample based on at least one of said transformed samples.

[0017] According to a second aspect of the embodiments, there is provided a node

(encoder or decoder) for filtering of a sample. The node comprises processing means operable to obtain a quantization parameter qp associated with said sample. The node comprises processing means operable to generate transform coefficients by applying a Hadamard transform to an area comprising said sample and at least one sample surrounding said sample. The node comprises processing means operable to obtain, based on qp, a filtered transform coefficient from a transform coefficient x using a piecewise linear function y with n ³ 2 pieces. The node comprises processing means operable to generate transformed samples by applying an inverse Hadamard transform on the filtered transform coefficients. The node comprises processing means operable to obtain a filtered version of said sample based on at least one of said transformed samples.

[0018] According to a third aspect of the embodiments, there is provided a computer program, for filtering of a sample. The computer program comprises code means which, when run on a computer, causes the computer to obtain a quantization parameter qp associated with said sample. The computer program comprises code means which, when run on a computer, causes the computer to generate transform coefficients by applying a Hadamard transform to an area comprising said sample and at least one sample surrounding said sample. The computer program comprises code means which, when run on a computer, causes the computer to obtain, based on qp, a filtered transform coefficient from a transform coefficient x using a piecewise linear function y with n ³ 2 pieces. The computer program comprises code means which, when run on a computer, causes the computer to generate transformed samples by applying an inverse Hadamard transform on the filtered transform coefficients. The computer program comprises code means which, when run on a computer, causes the computer to obtain a filtered version of said sample based on at least one of said transformed samples.

[0019] According to a fourth aspect of the embodiments, there is provided a computer program product comprising computer readable means and a computer program according to the third aspect, stored on the computer readable means.

[0020] According to a fifth aspect of the embodiments, there is provided a carrier containing the computer program according to the fourth aspect. The carrier is one of an electronic signal, optical signal, radio signal, or computer readable storage medium.

[0021] According to an embodiment, n = 2 and the piecewise linear function is given as

wherein k_qp and m_qp depend on qp.

[0022] Certain embodiments may provide one or more of the following technical advantage(s). One advantage of the proposed solution is a significant reduction of items that need to be stored. For example, when n =2, only two values need to be stored ( k_qp and m_qp), instead of 128. Accordingly, for qp ranging from 18 to 63, using piecewise linear function consisting of two linear pieces, the number of values needed to be stored is (63-l7)*2 = 92. In addition, if every k_qp and m_qp need 10 bits respectively, the total amount of bytes needed is 92*10/8=115 bytes, which is a reduction by (5888-115)/5888 = 98% as compared to [1]

Another advantage of the embodiments disclosed herein is that they allow the use of highly efficient SIMD implementations on CPUs.

[0023] Generally, all terms used herein are to be interpreted according to their ordinary meaning in the relevant technical field, unless a different meaning is clearly given and/or is implied from the context in which it is used. All references to a/an/the element, apparatus, component, means, step, etc. are to be interpreted openly as referring to at least one instance of the element, apparatus, component, means, step, etc., unless explicitly stated otherwise. The steps of any methods disclosed herein do not have to be performed in the exact order disclosed, unless a step is explicitly described as following or preceding another step and/or where it is implicit that a step must follow or precede another step. Any feature of any of the

embodiments disclosed herein may be applied to any other embodiment, wherever appropriate. Fikewise, any advantage of any of the embodiments may apply to any other embodiments, and vice versa. Other objectives, features, and advantages of the enclosed embodiments will be apparent from the following description.

BRIEF DESCRIPTION OF THE DRAWINGS

[0024] The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.

[0025] FIG. 1 illustrates a chart showing LUT values for a quantization parameter value and an approximated piecewise linear function according to one embodiment.

[0026] FIG. 2 illustrates a chart showing FUT values for a quantization parameter value and an approximated piecewise linear function according to one embodiment.

[0027] FIG. 3 illustrates a chart showing FUT values for a quantization parameter value and an approximated piecewise linear function according to one embodiment.

[0028] FIG. 4 illustrates a chart showing an approximated piecewise linear function according to one embodiment.

[0029] FIG. 5 is a flow chart illustrating a process according to one embodiment.

[0030] FIG. 6 is a diagram showing functional units of a node according to one embodiment.

[0031] FIG. 7 is a block diagram of a node according to one embodiment.

DETAIFED DESCRIPTION

[0032] Throughout this current disclosure, the filtering of intensity values is described as an example. The filtering of intensity values normally refers to the Y in YCbCr. However, this is not required and the filtering disclosed herein can also be used for chroma values such as Cb and Cr, or any other components from other color spaces such as ICTCP, Lab, Y’u’v’, among other, in alternative embodiments.

[0033] The general goal of the current disclosure is to approximate the function: (Equation 7a)

which the LUT tabulates for the first 128 values. Equation 7a can also be written as: (Equation 7b)

Equation 7a and Equation 7b are used interchangeably throughout the current disclosure.

While the LETT only tabulates up to 128 values, the function M/(b(ί), s) can be calculated for arbitrary numbers of /?(/). For example, the function 14/ (R(i), s) can be calculated in the possible range for R(Q, which is [-4092, 4092] when dealing with lO-bit data that has been transformed using the Hadamard transform. Namely, based on Equation 0 provided above, a person of ordinary skill in the art will realize that if the intensities are in the range [0, 1023], then the largest possible number is 4092 and the smallest possible number is -4092.

[0034] In some embodiments, for values of R(i ) between 0 and 127, W(R(i , a are equal to LUT(R(i), a ) and there is no distinction between approximating M/(7?(ί), s) and approximating LUT(R(i), a ).

[0035] According to one aspect, a method for filtering of a sample is provided, as shown in FIG. 5. The method comprises a step Sl of obtaining a quantization parameter qp associated with said sample. As already mentioned, filtering of the samples may be performed for qp values ranging from 18 to 63. The method further comprises a step S2 of generating transform coefficients by applying a Hadamard transform to an area comprising said sample and at least one sample surrounding said sample. An example of such an area is given in paragraph [003] where the sample to be filtered is i₀ and where the surrounding samples are i_t, i₂ and i₃.

Equation 0 shows how four transform coefficients R0-R3 are generated by applying a

Hadamard transform to an area of four samples i₀- i₃.

[0036] The method further comprises a step S3 of obtaining, based on qp, a filtered transform coefficient from a transform coefficient x using a piecewise linear function y with n ³ 2 pieces. Obtaining a filtered transform coefficient from a transform coefficient x using a piecewise linear function y is equivalent to approximating a function M/(7?(ί), s) or, equivalently, W(x, s ) that is also often denoted /-function throughout the description. Thus, using a piecewise linear function y with n > 2 pieces on a transform coefficient x is supposed to approximate the function W pc). Therefore, the terms“approximating the function W(x )” and“obtaining a filtered transform coefficient from a transform coefficient x using a piecewise linear function y” are interchangeably used throughout the rest of the application.

[0037] The method comprises a step S4 of generating transformed samples by applying an inverse Hadamard transform on the filtered transform coefficients. The method further comprises a step S5 of obtaining a filtered version of said sample based on at least one of said transformed samples. For example, a filtered version of said sample may be the corresponding transformed sample itself or it may be a combination of transformed samples surrounding said transformed sample.

[0038] According to some embodiments, the obtaining (S3) may be applied on the transform coefficients having an absolute value smaller than a threshold THR and wherein at least one piece of the piecewise linear function y has a slope different than zero. The value of THR may be a power of 2, for example 128, as will be described below. The value of threshold may be as high as the maximum value of a transform coefficient, for example 4092 in case of ten bits used for representation of samples. This basically means that filtered transform coefficients are obtained in step S3 for all the transform coefficients, regardless of their value.

[0039] The piecewise linear function y may be both continuous and non-continuous, as will be described below.

[0040] The current disclosure describes seven embodiments of a method for filtering a sample value. The first two embodiments use a continuous piecewise linear function with n=2 pieces to obtain filtered transform coefficients or, equivalently, to approximate the ML function. The next two embodiments use a piecewise linear function with n pieces to obtain filtered transform coefficients, where the piecewise linear function may be non-continuous. The following two embodiments also use the n piecewise linear functions, where the piecewise linear functions are connected at threshold points. The seventh embodiment is an efficient way to ensure that the piecewise linear function is continuous without spending excessive bits in the calculation. [0041] In a first embodiment, a two-piece piecewise linear function is used to approximate the W-function, i.e. to obtain filtered transform coefficients having values between 0 and THR-l = l27 (i.e., THR=l28). For transform coefficient values larger than 127 filtering is not applied. Accordingly, this indicates that the two-piece linear function is analogous to approximating the LUT. The LUT given in [1] can be expressed by LUT( , qp ), which depends on qp and the transform coefficient x. In the context of the current disclosure, the first variable for LUT, i.e., x, denotes the column and the second variable for LUT, i.e., qp, denotes the row, which is opposite to how it is indicated in, for instance, MATLAB. In some embodiments, the input x may correspond to luma or chroma coefficients.

[0042] A two-piece piecewise linear function with nonnegative values may be provided by y_qp = ma {k_qpx + m_qp, O). This function is an example of a continuous function. Given the LUT, it is possible to find a linear function that fits the LUT for each qp value. To illustrate this, FIG. 1 depicts an example for LUT( ,37) with qp = 37 and an approximated linear function in using parameters k₃₇ = 1.0194 and m₃₇ =—24.2107 with y_qp = max(k_qpx +

[0043] In the first embodiment, the values of k_qp and m_qp for different qp s from 18 to

63 may be equal to:

k_qp = [1.0326, 1.0354, 1.0352, 1.0418, 1.0469, 1.0509, 1.0535, 1.0595, 1.06, 1.0666, 1.0678,

I .0703, 1.0712, 1.0728, 1.0682, 1.0658, 1.0585, 1.0471, 1.036, 1.0194, 1.0002, 0.97052,

0.94113, 0.90866, 0.87668, 0.83445, 0.788, 0.73877, 0.68586, 0.64118, 0.58756, 0.5387, 0.4912, 0.44032, 0.3957, 0.35398, 0.31849, 0.28188, 0.24798, 0.21923, 0.19435, 0.16846, 0.14658, 0.12954, 0.1131, 0.09927], and

m_qp = [-4.5179,-5.0774,-5.4391,-6.2224,-6.9851,-7.7923,-8.6039,-9.6161,-10.4341,-11.724,- 12.7417,-14.0196,-15.2738,-16.7861,-17.8652,-19.3854,-20.6147,-21.7579,-23.1179,- 24.2107,-25.3656,-25.6423,-26.2748,-26.7235,-27.5008,-27.282,-26.838,-26.1413,-25.1188,- 24.609,-23.2322,-22.1601,-20.9494,-19.2451,-17.8859,-16.5127,-15.5019,-14.0914,-12.7722,-

I I .6584,-10.7233,-9.4997,-8.536,-7.8398,-7.0491,-6.4188], respectively. [0044] In some embodiments, the k_qp and m_qp values have been obtained by minimizing the mean squared error between the nonzero elements in the LUT and the linear function. In some embodiments, the k_qp and m_qp values may be stored with a fixed point representation. As an example, for nine bits of fractional resolution for k, it would be sufficient to store k in steps of 1/512. Accordingly, 10 bits would be sufficient to cover the entire range, since the maximum number stored 1023 would represent 1023/512 = 1.9980 which is larger than all the k values in the list provided above. Likewise, if five bits of fractional resolution is sufficient for m, then the m values may be stored in steps of 1/32. Without counting the sign bit, 10 bits would then be sufficient to store— m since the largest value would be 27.5008 which is smaller than the largest representable value 1023/32 = 31.96875.

[0045] While the THR is set as 128 in the above description of the first embodiment, this is not required; the THR value may be set to different values in different embodiments. If a better result can be obtained at THR = 256 or THR = 53, the THR value may be set accordingly. There is a tradeoff, however, as a higher THR will reduce the problem with discontinuity, but at the same time the approximated function W will not be as accurate for values lower than the higher THR.

[0046] Another aspect of the current disclosure is that the THR does not need to be the same for every qp value. Namely, it may be advantageous to have different values of THR for different qp values according to some embodiments. For example, for low qp values, such as qp = 18, the s is very small and hence the function W ( x ) is close to a straight line W ( ) « x, especially for large values of x. Hence for qp =18 it may make sense to use a small value of THR such as THR = 32. However, for high qp values, such as qp =63, the difference between W(x ) and x will be big unless x is large. In this case it may make sense to use a larger THR, such as THR = 2048.

[0047] In a second embodiment, a two-piece piecewise linear function y_qp =

+ m_qp, 0) is used for all values of x, not only for the values up to THR as described above in the first embodiment. This has the advantage that there is no discontinuity at x =

THR (i.e., x = 128), so problems that arise due to discontinuity are avoided. This is equivalent of setting THR = 4093 in the previous embodiment. [0048] In a third embodiment, there is provided a variant of embodiment 1 , where a piecewise linear function with n pieces is used for x up to THR— 1, such as THR— 1 = 127. Since x is between 0 and 127, the range of input x is divided into n bins with each bin covering a range of 128 /n. A linear function is used to approximate each bin. The number of bins may be small enough such that efficient SIMD operations can be used. As an example, if the SIMD architecture allows for look-up from a 128-bit register, and linear values k and m are stored using 8 bits each, it may be good to use n = 16 i.e., using 8 different pieces in the piecewise linear function. This way one SIMD operation can be used to obtain k and another SIMD operation can be used to obtain m. Alternatively, n = 8 may be used and both k and m values may be obtained in a single SIMD operation. In some embodiments, the number of bins can be chosen as needed. In some embodiments, the range of x does not have to be limited to [0, 127] The third embodiment can be applied to x with a larger range, e.g., [0, 1023] if needed. For the purpose of explanation and the sake of simplicity, x is assumed to be between 0 and 127 in the following description.

[0049] The thresholds of each of the n bins may be denoted as to, ti, t₂, ... t_n.

Accordingly, [to, ti, t₂,..., t_n] = [0, 128/h, 2* 128/h, ..., n* l28/n]. For a given qp value, a linear function is used to approximate the LUT within each bin, as shown below in Equation 9:

Thus, for each bin, a pair of k_{qp bin} and m_{qp bin} values is needed. In some embodiments, a value of x e [tq, t₂ ] may never cause k_{qp bini}x + m_{qP btn} to be negative. In such embodiments, the max-operation may be omitted, thereby preserving computation resources. With the LUT(qp, x) in [1 ], we can compute that

for i = 0, ... n— 1. (Equation 10)

Accordingly,

(Equation 1 1).

[0050] The values of k and m can be stored into LUTk_ qp and LUT_m_ qp,

bits

right, as shown below in Equations 10 and 11 :

(Equation 12)

(Equation 13)

[0051] For each qp value, 2 n values need to be stored. For qp values ranging from 18 to 63, (63— 17) * 2 n = 92 n values need to be stored. If n = 8, this gives 736 bytes that need to be stored, which is significantly smaller than 5888 bytes.

[0052] Taking n = 8 and qp =37 as an example, the values of LEiT_k_^ and LETT_m _qp may be LUT_k_¾p = [0.066667,0.46667,0.86667,1 ,1.1333,1.1333,1.1333,1.0667] and LUT_m_¾p = [0, -6.4667, -19.7333, -26, -34.5333, -34.6667, -34.8, -27.4667] The approximation is depicted in FIG. 2. In some embodiments, the approximated piecewise linear functions shown in FIG. 2 may have 8 (possible disconnected) piecewise linear functions.

[0053] A fourth embodiment is similar to the third embodiment, in that an n- piece piecewise linear function is used. However, in the fourth embodiment, the approximation is applied for the entire range of x. As described above, this range of x can be [-4092, 4092] for lO-bit values that have been transformed with the Hadamard transform. In some embodiments, the range of x may be [0, 4092] if the W (x) is symmetric.

[0054] While the third embodiment is similar to the second embodiment, the second embodiment may be discontinuous at the thresholds. Accordingly, a solution is provided as a fifth embodiment in which the piecewise linear function is continuous. In Equation 9, the value of y may differ for x<t; and x>ft. In the fifth embodiment, the equation is modified as shown below:

Thus, for ί = 0, ... , n— 2, y is provided as:

LUTjqp ,t_i+1)- LUTjqp ,t_j )

y = LUT(qp, t_£) + (x - t_£) (Equation 15) and the k and m value may be expressed as (Equation 16)

For ί = n— l, y, k_{qP bin} ^and Tn_qp,bin are provided by Equation 10 and Equation 11 , respectively.

[0055] Similarly, we store the values of k and m with LEiTk_qp and LETT _qp, respectively. Equation 12 and Equation 13 may be used to retrieve the LUT values.

[0056] Taking again n=8 and qp=37 as an example, the values of LUT_k_qp and

LUT _m_qp may be LUT_k_qp = [0.0625,0.4375,0.875,1 ,1.125,1.125,1.125,1.0667], and

LUT_m_qp = [0,-6,-20,-26,-34,-34,-34,-27.4667] The approximation is depicted in FIG. 3. In some embodiments, the approximated piecewise linear functions have 8 connected piecewise linear functions.

[0057] A sixth embodiment is provided similar to the fifth embodiment. In the sixth embodiment, the piecewise linear function with n pieces is used for the entire range of x. As described above, this range of x can be [-4092, 4092] for lO-bit values that have been transformed with the Hadamard transform.

[0058] In a seventh embodiment, there may be some cases where it may be

advantageous not to calculate the piecewise linear approximation according to y = kx + m, since it may require high resolution in fix point implementations. As an example, let us assume that the curve W(x) shown in FIG. 4 needs to be approximated using piecewise linear approximation. FIG. 4 illustrates approximating the curve W(x) (shown in solid lines) using a piecewise linear line segment (shown in dotted lines).

[0059] Assume, for simplicity, that the difference between x_k and x_k+1 is constant D, x_{k+ 1}— x_k = D. In some embodiments, this constraint may be relaxed, thereby allowing for denser line segments near zero, but for the purpose of explanation, the difference is assumed as constant.

[0060] The function W(x ) has been tabulated in the endpoints of the line segments.

Hence for x = x₀, x₁, x₂, ... , x_k, ... the function values are known as W(x₀), 147 (x- , W(x₂), ..., M7(k), ... etc.

[0061] Let us now assume that the approximate value W (x) needs to be calculated in the general position x. First, the index is calculated for the largest value x_k that is smaller than x:

k = x div D (Equation 17)

where div performs division and rounding down. Also, x_k = k * A.

[0062] In practice, it is advantageous to use a D that is a power of two, for example,

D = 16, since division can be simply replaced with a right shift. In the context of the current disclosure, D = 16 is used as an example, but this is not required and different values of D may be used for different qp values in alternative embodiments. In this instance, the calculation simplifies to

k = x » 4, (Equation 18)

x_k = k « 4 (Equation 19)

where » denotes rightwards shift and « denotes leftwards shift. The value for 144 ⁼ 14 (x_/f) and for VI4₊₁ ⁼ W ( _{+ 1}) may ^now be obtained by indexing a look-up table with k:

W_k = LUT_w{k )

+₁ ⁼ LUT_w(k + 1) (Equation 20)

The difference between x and x_k may also be calculated as: Ax = x— x_k (Equation 21)

[0063] The approximate value W (x) may now be calculated as the value W_k plus Ax steps along the slope ( W_k+1— M4)/D: (Equation 22 a)

Equation 22a may also be rewritten as:

(Equation 22 b) .

Ax

[0064] For sufficient precision,— may be represented using steps of 1/16. This requires four fractional bits. This is multiplied by a difference W_k+1— W_k which in theory may be very large, but since W (x) is close to x this difference is always positive and should never be bigger than two times D, in this case 32. This means that only five bits are needed to represent the difference.

[0065] In practice the following operations may take place:

(Equation 23)

W(x) = W_k + diff (Equation 24).

Accordingly, there may be no need for any big multiplications. Furthermore, since W_k+1— W_k is small, it may be advantageous to store that in a separate FFTT for parallel fetching with W_k .

[0066] In summary, the entire calculation may be as the following:

k = x » 4

x_k = k « 4

Ax = x— x_k

W_k = LUT_w(k )

W_{k+ 1} W_k = LUT other (k )

W(x) = W_k + diff [0067] This seventh embodiment may be viewed as equivalent with the other embodiments described above because Equation 22a may be provided as:

and then further rewritten as:

where ^{Wk+ Wk} may be identified as the slope k value and W_k—

x_k could be identified as the m value. However, calculating ^Wk+1

^Wk with enough precision may require 5 integer bits (to hold 32) and four fractional bits (to represent steps of 1/16), or 9 bits in total. This would be multiplied by x which would be a 12 bit number. Hence a 9 times 12 bit multiplication would be required, which is much more than the 5 times 4 bit multiplication described in the other embodiments above. Hence, the seventh embodiment is less costly in this regard. This seventh embodiment may be used to approximate the LUT up to THR or it can be used to approximate the entire function W ( ).

[0068] One advantage of the proposed solution is a significant reduction of items that need to be stored. For example, when n =2, only two values need to be stored ( k_qp and m_qp), instead of 128. Accordingly, for qp ranging from 18 to 63, using piecewise linear function consisting of two linear pieces, the number of values needed to be stored is (63-l7)*2 = 92. In addition, if every k_qp and m_qp need 10 bits respectively, the total amount of bytes needed is 92*10/8=115 bytes, which is a reduction by (5888-115)/5888 = 98% as compared to [1] Another advantage of the embodiments disclosed herein is that they allow the use of highly efficient SIMD implementations on CPUs.

[0069] Another advantage of the embodiments disclosed herein is that they allow the use of highly efficient SIMD implementations on CPUs. SIMD operations allow the execution of several operations simultaneously on a modern CPU. As an example, if a normal machine code instruction can add two numbers to each other, a SIMD operation can add eight numbers to eight other numbers in parallel. This can improve performance considerably. There are SIMD operations for performing table look-ups. However, such SIMD operations need the entire LUT to be stored in a single SIMD register. Such registers are typically of the size of 128 bits. If 8-bit values are used, this means the largest number of items that such an operation can handle is 128/8 = 16 items. The LUT37 array described as an example above has 128 items, and it would therefore be too big to implement using a single SIMD operation on current hardware. In contrast, it is easy to execute arithmetic operations used in Equation 2, for example, using SIMD instructions.

[0070] More specifically, while the k₃₇ and m₃₇ would be obtained from a LUT, this happens once per block and is therefore not in the inner loop where SIMD optimization is necessary. The inner loop instead contains several executions of Equation 6, each execution including a multiplication and addition followed by a max operation. On many modem CPUs it is possible to perform multiplication followed by addition in parallel, meaning that, say eight, parallel computations of Equation 6 can be carried out in just two instructions; one for the multiply and add and one for the max-operation.

[0071] FIG. 6 is a diagram showing functional units of a node 602 for filtering of a sample according to one embodiment. Node 602 may for example be an encoder. Alternatively, node 602 may be a decoder.

[0072] Node 602 includes an obtaining unit 604 for obtaining a quantization parameter qp associated with said sample. Node 602 includes a generating unit 606 for generating transform coefficients by applying a Hadamard transform to an area comprising said sample and at least one sample surrounding said sample. Node 602 further includes an obtaining unit 608 for obtaining, based on qp, a filtered transform coefficient from a transform coefficient x using a piecewise linear function y with n ³ 2 pieces. Node 602 further includes a generating unit 610 for generating transformed samples by applying an inverse Hadamard transform on the filtered transform coefficients. Node 602 includes a generating unit 610 for generating transformed samples by applying an inverse Hadamard transform on the filtered transform coefficients. Node 602 includes an obtaining unit 612 for obtaining a filtered version of said sample based on at least one of said transformed samples. [0073] FIG. 7 is a block diagram of a node 602 for filtering of a sample according to one embodiment. Node 602 may for example be an encoder. Alternatively, node 602 may be a decoder.

[0074] As shown in FIG. 7, node 602 may comprise: processing circuitry (PC) 702, which may include one or more processors (P) 755 (e.g., a general purpose microprocessor and/or one or more other processors, such as an application specific integrated circuit (ASIC), field-programmable gate arrays (FPGAs), and the like); a network interface 748 comprising a transmitter (Tx) 745 and a receiver (Rx) 747 for enabling node 602 to transmit data to and receive data from other nodes connected to a network 710 (e.g., an Internet Protocol (IP) network) to which network interface 748 is connected; and a local storage unit (a.k.a.,“data storage system”) 708, which may include one or more non-volatile storage devices and/or one or more volatile storage devices. In embodiments where PC 702 includes a programmable processor, a computer program product (CPP) 741 may be provided. CPP 741 includes a computer readable medium (CRM) 742 storing a computer program (CP) 743 comprising computer readable instructions (CRI) 744. CRM 742 may be a non-transitory computer readable medium, such as, magnetic media (e.g., a hard disk), optical media, memory devices (e.g., random access memory, flash memory), and the like. In some embodiments, the CRI 744 of computer program 743 is configured such that when executed by PC 702, the CRI causes node 602 to perform steps and the embodiments described herein (e.g., steps described herein with reference to the flow charts). In other embodiments, node 602 may be configured to perform steps described herein without the need for code. That is, for example, PC 702 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software.

[0075] While various embodiments of the present disclosure are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context. [0076] Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.

REFERENCES

[1] V. Stepin et al.:“CE2 related: Hadamard Transform Domain Filter”, Joint Video Experts Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, July 2018, document JVET-K0068-v3.

[2] J. Strom et al.:“CE2 related: Reduced complexity bilateral filter”, Joint Video Experts Team (JVET) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11, July 2018, document: JVET-K0274-v4.

Claims

1. A method for filtering of a sample, the method comprising:

obtaining (Sl) a quantization parameter qp associated with said sample;

generating (S2) transform coefficients by applying a Hadamard transform to an area comprising said sample and at least one sample surrounding said sample;

obtaining (S3), based on qp, a filtered transform coefficient from a transform coefficient x using a piecewise linear function y with n ³ 2 pieces; generating (S4) transformed samples by applying an inverse Hadamard transform on the filtered transform coefficients; and

obtaining (S5) a filtered version of said sample based on at least one of said transformed samples.

2. The method of claim 1 , wherein the obtaining (S3) is applied on the transform coefficients having an absolute value smaller than a threshold THR and wherein at least one piece of the piecewise linear function y has a slope different than zero .

3. The method of any of claims 1-2, wherein the piecewise linear function y is continuous.

4. The method of claim 3, wherein n = 2 and the piecewise linear function is given as

wherein k_qp and m_qp depend on qp.

5. The method of claim 4, wherein the obtaining (S3) is applied on the transform coefficients in the entire range of their values.

6. The method of claim 4, wherein the threshold THR = 128.

7. The method of any of claims 1-6, wherein the method is performed by an encoder.

8. The method of any of claims 1-7, wherein the method is performed by a decoder.

9. A node (602) for filtering of a sample, the node configured to:

obtain a quantization parameter qp associated with said sample;

generate transform coefficients by applying a Hadamard transform to an area comprising said sample and at least one sample surrounding said sample;

obtain, based on qp, a filtered transform coefficient from a transform coefficient x using a piecewise linear function y with n ³ 2 pieces;

generate transformed samples by applying an inverse Hadamard transform on the filtered transform coefficients; and

obtain a filtered version of said sample based on at least one of said transformed samples.

10. The node (602) of claim 9, wherein the node is configured to obtain filtered transform coefficients from the transform coefficients having an absolute value smaller than a threshold THR and wherein at least one piece of the piecewise linear function y has a slope different than zero.

11. The node (602) of any of claims 9-10, wherein the piecewise linear function y is continuous.

12. The node (602) of claim 11 , wherein n = 2 and the piecewise linear function is given

wherein k_qp and m_qp depend on qp.

13. The node (602) of claim 12, wherein the node is configured to obtain filtered transform coefficients in the entire range of the values of the transform coefficients.

14. The node (602) of claim 12 wherein the threshold THR = 128.

15. The node (602) of any of claims 9-14, wherein the node is an encoder.

16. The node (602) of any of claims 9-14, wherein the node is a decoder.

17. A computer program comprising instructions which when executed by processing circuity of a node causes the node to perform the method of any one of embodiments 9-14.

18. A carrier containing the computer program of claim 17, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.