WO2020007748A1 - Bilateral filter with lut avoiding unnecessary multiplication and minimizing the lut - Google Patents

Bilateral filter with lut avoiding unnecessary multiplication and minimizing the lut Download PDF

Info

Publication number
WO2020007748A1
WO2020007748A1 PCT/EP2019/067429 EP2019067429W WO2020007748A1 WO 2020007748 A1 WO2020007748 A1 WO 2020007748A1 EP 2019067429 W EP2019067429 W EP 2019067429W WO 2020007748 A1 WO2020007748 A1 WO 2020007748A1
Authority
WO
WIPO (PCT)
Prior art keywords
intensity
sample
lut
difference
value
Prior art date
Application number
PCT/EP2019/067429
Other languages
French (fr)
Inventor
Jacob STRÖM
Per Wennersten
Jack ENHORN
Du LIU
Original Assignee
Telefonaktiebolaget Lm Ericsson (Publ)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget Lm Ericsson (Publ) filed Critical Telefonaktiebolaget Lm Ericsson (Publ)
Publication of WO2020007748A1 publication Critical patent/WO2020007748A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/117Filters, e.g. for pre-processing or post-processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/182Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a pixel
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/80Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • H04N19/86Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression involving reduction of coding artifacts, e.g. of blockiness

Definitions

  • Bilateral filtering of image data directly after forming the reconstructed image block can be beneficial for video compression.
  • in“Bilateral Filtering for Video Coding” (referred to as [1] hereafter, and incorporated herein in its entirety)
  • bilateral filtering involves a division, which can be expensive for hardware implementations. Therefore Wennersten et al.
  • the filter weights in a bilateral filter depend on the image data, so they need to be calculated on-the-fly or obtained from a look-up-table (“LUT”).
  • LUT look-up-table
  • 2 ,202 bytes were needed for this LUT.
  • Another 576 bytes were needed for the division table, yielding 2,778 bytes in total for the solution in [1]
  • the implementation proposed in [2] used a LUT of about 33,000 values.
  • the bilateral filter is costly to implement for some forms of implementations, such as in a fully custom ASIC
  • Embodiments may include one or more of these aspects, including all of the aspects together or any other combination.
  • the filter contribution from each surrounding pixel is calculated as a multiplication of three numbers; distance * range * D1.
  • the first of these multiplications is avoided by fetching a pre -multiplied value of distance * range from a three- dimensional look-up table (LUT).
  • LUT look-up table
  • This three-dimensional LUT becomes very big, around 33,000 values. Therefore a first aspect is to avoid this pre-multiplication.
  • the two-dimensional LUT is six times smaller. This solution reintroduces a multiplication; in some embodiments, this multiplication can be moved so that it is done only once per filtered pixel, yielding significant savings.
  • a second aspect is to avoid the second multiplication between range and DI. This is done by pre-calculating this multiplication in the LUT instead of performing the multiplication in the filtering operation. Lortunately, this removal can be made without increasing the dimensionality of the LUT. This means that we can save four multiplications per filtered pixel. When taking into account the multiplication introduced in the first aspect, we therefore only need one multiplication per filtered pixel in total, whereas the solution in [2] may need up to four.
  • the two-dimensional LUT we end up with depends on two parameters, the quantization parameter (qp) and the delta intensity DI.
  • qp quantization parameter
  • DI delta intensity
  • this way of reusing tables can be applied to the solution [2] as is, or to the solution [2] as modified by one or two of the two previously mentioned aspects, and it can also be applied to the solution in [1]
  • An alternative use of this second aspect is to make each row smaller. As an example, if the longest row is 235 bytes, the scaling transformation can be used to get this down to smaller than 16. This makes it possible to implement the LUT using SIMD (single instruction, multiple data) instructions, which typically cannot handle LUTs larger than 16 elements.
  • SIMD single instruction, multiple data
  • the bilateral filter is typically placed inside the intra-prediction loop. This means that when creating the intra prediction for the current block, filtered pixels from the previous block may be used. In the decoder, this means that the previous block may have to be fully reconstructed and filtered using the bilateral filter before we can start to construct the prediction for the current block.
  • This intra prediction can be in the critical path of the decoder. Inserting a filter into this path increases the latency for this critical path. This in turn means that the clock- frequency of the chip may need to be lowered, perhaps to a point where all pixels of a frame cannot be decoded in time. Thus it is of outmost important to be able to do this filtering as quickly as possible.
  • the filter where it is not placed inside the intra-prediction loop. As an example, it can be placed as a loop-filter, for instance right before deblocking, after deblocking, or in parallel with sample adaptive offset filtering or in parallel with the adaptive loop filter stage. However, even in this case it will typically not be sufficient with a single instantiation of the LUT to process all of the pixels in a larger image, such as 4K or 8K resolution. Hence, also in this case several instantiations of the LUT would be necessary.
  • the next block to the right may read any of the right-most pixels in the current block for its prediction. Therefore, all four pixels in the right-most column may need to be filtered as soon as possible.
  • One way of doing this is to parallelize the filtering.
  • the bilateral filter is fully parallelizable.
  • all four pixels in the right-most column can be filtered in parallel without changing the result of the filtering.
  • the filter can be placed as a loop-filter, for instance right before deblocking, after deblocking, or in parallel with sample adaptive offset filtering or in parallel with the adaptive loop filter stage.
  • a method for applying bilateral filtering to a media object comprising a plurality of samples includes, for a current sample C, computing a filtered sample value I F based on one or more neighboring samples above (A), below ( B ), to the left ( L ), and to the right (R) of the current sample C.
  • the filtered sample value I F is given by the equation
  • I c is the current sample intensity before filtering
  • a d is a spatial strength parameter
  • a r is an intensity strength parameter
  • computing the filtered sample value I F comprises using a lookup table with two or fewer dimensions.
  • the lookup table is two-dimensional and depends on a r and AI and where AI is an intensity difference AI a , AI b , AI l , and/or AI R .
  • AI is an intensity difference AI a , AI b , AI l , and/or AI R .
  • the lookup table is used to determine e 2 a r where AI is an intensity difference AI a , AI B , AI l , and/or AI R . In some embodiments, the lookup table is used to compute an
  • the lookup table is one-dimensional and depends on AI and where AI is an intensity difference AI a , AI b , AI l , and/or AI R .
  • the lookup table is created based on a fixed value of a r (a r0 ) and when computing the filtered sample value I F for a different value of a r ( a rl ), a scaling transform s(a rl , AI) is applied using a constant c determined from a r0 and a rl .
  • the filtered sample value I F is approximated using fixed point numbers.
  • I F is approximated by I F which is given by
  • using the lookup table comprises executing one or more single instruction, multiple data (SIMD) vector operations.
  • SIMD single instruction, multiple data
  • a size of a row of the lookup table is no more than 128 bits.
  • an encoder for applying bilateral filtering to a media object comprising a plurality of samples includes a computing unit configured to, for a current sample C, compute a filtered sample value I F based on one or more neighboring samples above (A), below (B), to the left (Z), and to the right (R) of the current sample C, wherein the filtered sample value I F is given by the equation
  • I c is the current sample intensity before filtering
  • a d is a spatial strength parameter
  • a r is an intensity strength parameter
  • computing the filtered sample value I F comprises using a lookup table with two or fewer dimensions.
  • a decoder for applying bilateral filtering to a media object comprising a plurality of samples.
  • the decoder includes a computing unit configured to, for a current sample C, compute a filtered sample value I F based on one or more neighboring samples above (A), below (B), to the left (Z), and to the right (R) of the current sample C, wherein the filtered sample value I F is given by the equation
  • a d is a spatial strength parameter
  • a r is an intensity strength parameter
  • computing the filtered sample value I F comprises using a lookup table with two or fewer dimensions.
  • a carrier containing the computer program of embodiments is provided, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.
  • FIG. 1 illustrates a plot for various quality parameter (qp) values according to one embodiment.
  • FIG. 2 illustrates a plot according to one embodiment.
  • FIG. 3 is a flow chart illustrating a process according to one embodiment.
  • FIG. 4 is a flow chart illustrating a process according to one embodiment.
  • FIG. 5 is a diagram showing functional units of a node according to one embodiment.
  • FIG. 6 is a block diagram of a node according to one embodiment.
  • I F is the filtered pixel intensity and I c is the center pixel intensity, i.e., the pixel intensity before filtering.
  • the intensity differences AI a , AI b , AI l , and Al R are changing with every pixel, since they depend on the center intensity values of the center pixel I c and the intensity values of the surrounding pixels ( I A , I B , I L , and I R ). This means that the denominator in Equation 1 will be different for every pixel.
  • a division is needed, and that can be implemented using a division table as described in [1]
  • Equation 3 Since e 2 a r is strictly smaller than 1, each term in the denominator will become a bit larger, and the larger denominator in Equation 3 will thus give a weaker filtering (less deviation from the original pixel I c ) than Equation 2 would give given the same parameters a d and a r .
  • Range k in [2]
  • Equation 5 is further simplified in [2] to
  • upper case W is different from lower case w here; as
  • Equation 6 is written using a summation symbol, which in our notation becomes
  • the (upper case) weight W Gd Gr AI x depends on three variables: a d , a r , and AI X .
  • a d can take on six different values
  • a r can take on 34 different values
  • AI X up to 1 ,023 different values.
  • the weight is stored as a look-up table (LUT)
  • up to 6*34*1,023 208,692 values need to be stored. While this may not be much of a problem for a CPU-based implementation, it can be a serious problem for an implementation in full custom ASIC, where such a large LUT will translate to a sizable part of the available silicon area.
  • Equation 9 The second limitation with the implementation according to Equation 9 is that it mandates a multiplication inside the summation; the (upper case) weight W ad 7r (AI X ) is multiplied with the intensity difference Al x . Again, this may not be much of an issue for a CPU-based implementation, but for a full custom ASIC type implementation this can be expensive, especially if the filter needs to be instantiated several times to achieve parallelism.
  • Equation 11 By using Equation 11 we can now rewrite the bracketed expression in Equation
  • Equation 10 as a sum of these influence terms. Equation 10 then simplifies to,
  • I F I C + ⁇ (s a ) * (m st (DI A ) + m st (DI B ) + m st (DI B ) + m sG (D7 k )).
  • LUT can depend on two variables, s t and D7.
  • the second aspect is that we do not store e 2o r
  • the actual value stored in the look-up table will be the integer
  • the filtered pixel is instead calculated as
  • the pixel being filtered will always have access to all its surrounding pixels at the time of filtering.
  • the filter is inside the intra prediction loop, as is the case for both [1] and [2], this cannot always be the case.
  • the filter is inside the intra prediction loop, as is the case for both [1] and [2].
  • the filtering will have to do with fewer surrounding pixels.
  • Equation 6 in which the last term of Equation 6 has been removed.
  • Equation 12 can be used for pixels that are neither border pixel nor comer pixels, whereas for border pixels we can use
  • the filtered pixel is then calculated as,
  • X can be A, B, L or R. This is equal to
  • Equation 1 can be written as
  • the scaling can be used to set a maximum number for the largest number of element in a LUT row.
  • Many decoders and encoders are implemented in software, and in order to have an efficient implementation it is well-known that SIMD operations are often used to speed up execution of the software. There are efficient SIMD operations for table look-up operations, but they typically have a restrictions on the number of entries that can be used in the LUT.
  • a common way to implement LUT in SIMD is to put the entire LUT in a SIMD register, which can be 128 bits.
  • every entry is eight bits, that means that there is only room for 16 entries in the LUT.
  • decoding or encoding a block only one LUT row at a time needs to be used, since a block is restricted to a single qp. However, this still puts a restriction that every LUT row cannot be more than 16 non-zero items.
  • two LUT operations with two different registers can be used to obtain the equivalent of a look-up from 32 elements. However, this costs one valuable register and one extra instruction, making it harder to make the code run quickly.
  • a maximum of 16 entries should be used per row. This can be done by selecting a c-value for every q-value so that the resulting table has at most 16 non-zero values (or, alternatively, 15 non-zero values).
  • mapping from qp-value to a r is not something that is derived from some strict principle, but rather something that seems to work reasonably well. If something works reasonably well for qp 21 it may also work reasonably well for qp 20.
  • weight lookupTablePtr [min (theMaxPos , (abs(deltal) « c shift)];
  • LUT * (37, AI) ⁇ 254, 235, 187, 126, 73, 36, 15, 5, 2, 1, 0, 0, 0, 0, 0, ⁇ ;
  • LUT * (38, AI) ⁇ 254, 237, 192, 135, 82, 43, 20, 8, 3, 1, 0, 0, 0, 0, 0, 0, ⁇ ;
  • LUT * 39, I) ⁇ 254, 239, 197, 143, 91, 50, 25, 11, 4, 1, 1, 0, 0, 0, 0, 0, ⁇ ;
  • LUT * 40, M) ⁇ 254, 240, 201, 150, 99, 58, 30, 14, 6, 2, 1, 0, 0, 0, 0, 0, ⁇ ;
  • LUT * 41, M) ⁇ 254, 241, 205, 156, 107, 65, 36, 18, 8, 3, 1, 1, 0, 0, 0, 0, ⁇ ;
  • Lt/T* (42,D/) ⁇ 254, 242, 209, 162, 114, 73, 42, 22, 10, 4, 2, 1, 0, 0, 0, 0, ⁇ ;
  • LUT * 44, M) ⁇ 254, 244, 215, 173, 128, 87, 54, 31, 16, 8, 3, 1, 1, 0, 0, 0, ⁇ ;
  • Lt/r (45, D7) ⁇ 254, 245, 217, 178, 134, 93, 60, 35, 19, 10, 5, 2, 1, 0, 0, 0, ⁇ ;
  • LUT * 46, M) ⁇ 254, 245, 220, 182, 140, 100, 66, 41, 23, 12, 6, 3, 1, 1, 0, 0, ⁇ ;
  • LUT * 47, M) ⁇ 254, 246, 222, 186, 146, 106, 72, 46, 27, 15, 8, 4, 2, 1, 0, 0, ⁇ ;
  • Lt/r (50, D/ ) ⁇ 254, 247, 227, 197, 160, 124, 90, 61, 40, 24, 14, 8, 4, 2, 1, 1, ⁇ ;
  • LUT * 51, A1 ) ⁇ 255, 248, 229, 200, 165, 129, 95, 67, 44, 28, 16, 9, 5, 2, 1, 1, ⁇ ;
  • LUT(S1, A1) ⁇ 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248,
  • this function is step-wise constant.
  • the result will be a function that is piecewise increasing (since D7 is increasing but the LUT -value is constant), followed by discontinuities when the LUT -value goes down.
  • halfjval ⁇ 0, 0, 1, 1, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8
  • LUT qp,AI LUT*(qp,Al » shift _vaKqpf) and the multiplication with AI can be, for example,
  • Equation 25 (repeated for convenience) works also here: ⁇ AI ⁇ ( ⁇ & 3 ⁇ 4 ) L ,51 - 17,
  • weight inv c value * lookupTablePtr [min (theMaxPos , (abs(deltal) * c value) » 4 ) ] » 4 ;
  • the pseudo code above uses two table rows as base tables (32 and 39), but it is possible to use any number of rows as base table rows, including using four rows (e.g., 50, 46, 42, and 38) as base rows. Again, a software version can be much more efficiently rewritten. For a hardware implementation though this is valuable. This combines all three aspects; reducing the size of the LUT by reducing it to two dimensions, avoiding a multiplication, and finally reusing the rows of the LUT by scaling.
  • I F is the filtered pixel intensity
  • I c is the center pixel intensity, i.e., the pixel intensity before filtering.
  • I B , I L and I R are the intensities of the pixel immediately below, left and right of the center pixel respectively, and
  • the weights w A , w B , w L and w R are potentially all different, and they depend on the intensity differences DI A , DI B , d/ L and DI K .
  • w B LUT ( j qp , DI b ) .
  • a LUT implementation cannot read out two different values at once. Hence in this case we would need two LUT instantiations in order to be able to read out w A and w B in parallel. For the same reason, to get all weights we would need four LUT instantiations.
  • the intensity of pixel T is denoted I T
  • the intensity of pixel S is denoted I s etc.
  • the filtered pixel T can now be calculated using
  • w c is a center weight that is constant for the block and therefore does not need to be looked up.
  • embodiments disclosed herein allow the same filtered pixel value to be calculated using only one multiplication per pixel.
  • FIG. 3 illustrates a process 300 of applying bilateral filtering to a media object comprising a plurality of samples. For each sample C in a media object, of the plurality of samples (step 302), it is determined if one or more neighbors of sample C that are above (A), below (B), to the left (L) or to the right (R) of sample C are available (step 304). As explained above, in some embodiments, the sample being filtered will always have access to all its surrounding samples at the time of filtering. However, in other embodiments, this cannot always be the case. As an example, assume we are filtering a block, and the block to the right has not yet been decoded.
  • step 310 may use d edBe (c3 ⁇ 4) when filtering edge pixels and d corner C ⁇ ) when filtering comer pixels, whereas step 308 may use d(a d ).
  • FIG. 4 illustrates a process 400 of applying bilateral filtering to a media object comprising a plurality of samples.
  • the method includes, for a current sample C, computing a filtered sample value I F based on one or more neighboring samples above (A), below (B), to the left ( L ), and to the right (R) of the current sample C (step 402).
  • the filtered sample value I F is given by the equation
  • I c is the current sample intensity before filtering
  • d(a d ) — Computing the l + e 2a d+e 2a d +e 2a d +e 2a d l+4e 2a d
  • filtered sample value I F comprises using a lookup table with two or fewer dimensions (step 404).
  • the lookup table is two-dimensional and depends on a r and AI and where AI is an intensity difference AI a , AI b , AI l , and/or AI R .
  • AI is an intensity difference AI a , AI b , AI l , and/or AI R .
  • the lookup is two-dimensional and depends on a r and AI and where AI is an intensity difference AI a , AI b , AI l , and/or AI R .
  • the lookup table is two-dimensional and depends on a r and AI and where AI is an intensity difference AI a , AI b , AI l , and/or AI R .
  • AI is an intensity difference AI a , AI b , AI l , and/or AI R .
  • the lookup table is used to compute an influence function m sG (D/), where
  • the lookup table is one-dimensional and depends on AI and where AI is an intensity difference AI a , AI b , AI l , and/or AI R .
  • the lookup table is created based on a fixed value of s t (s t0 ) and when computing the filtered sample value I F for a different value of s t (s G ⁇ ), a scaling transform s(a rl , AI) is applied using a constant c determined from a r0 and a rl .
  • the filtered sample value I F is approximated using fixed point numbers.
  • I F is approximated by I F which is given by
  • I F is approximated by I F which is given by
  • FIG. 5 is a diagram showing functional units of node 502 (e.g. an encoder/decoder) for applying bilateral filtering to a media object comprising a plurality of samples, according to an embodiment.
  • Node 502 includes a computing unit 504.
  • Computing unit 504 is configured to, for a current sample C, compute a filtered sample value I F based on one or more neighboring samples above (A), below (B), to the left (Z), and to the right (R) of the current sample C,
  • I c is the current sample intensity before filtering
  • a d is a spatial strength parameter
  • a r is an intensity strength parameter
  • filtered sample value I F comprises using a lookup table with two or fewer dimensions.
  • FIG. 6 is a block diagram of node 502 (e.g., an encoder/decoder) for applying bilateral filtering to a media object comprising a plurality of samples, according to some embodiments.
  • node 502 may comprise: processing circuitry (PC) 602, which may include one or more processors (P) 655 (e.g., a general purpose microprocessor and/or one or more other processors, such as an application specific integrated circuit (ASIC), field-programmable gate arrays (FPGAs), and the like); a network interface 648 comprising a transmitter (Tx) 645 and a receiver (Rx) 647 for enabling node 502 to transmit data to and receive data from other nodes connected to a network 610 (e.g., an Internet Protocol (IP) network) to which network interface 648 is connected; and a local storage unit (a.k.a.,“data storage system”) 608, which may include one or more non-volatile storage devices and/or one or more
  • PC processing circuitry
  • CPP 641 includes a computer readable medium (CRM) 642 storing a computer program (CP) 643 comprising computer readable instructions (CRI) 644.
  • CRM 642 may be a non-transitory computer readable medium, such as, magnetic media (e.g., a hard disk), optical media, memory devices (e.g., random access memory, flash memory), and the like.
  • the CRI 644 of computer program 643 is configured such that when executed by PC 602, the CRI causes node 502 to perform steps described herein (e.g., steps described herein with reference to the flow charts).
  • node 502 may be configured to perform steps described herein without the need for code. That is, for example, PC 602 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)

Abstract

A method for applying bilateral filtering to a media object comprising a plurality of samples is provided. The method includes, for a current sample C, computing a filtered sample value IF based on one or more neighboring samples above (A), below (B), to the left (L), and to the right (R) of the current sample C. The filtered sample value I F is given by (Formula I) where: I C is the current sample intensity before filtering, ΔI A , ΔI B , ΔI L , and ΔI R are the differences, respectively, between the current sample intensity I C and the neighboring samples above, below, to the left, and to the right of the current sample, σ d and σ r are strength parameters; and d d ) is given by d d ) = (Formula II), and wherein computing I F comprises using a lookup table with two or fewer dimensions.

Description

BILATERAL FILTER WITH LUT AVOIDING UNNECESSARY
MULTIPLICATION AND MINIMIZING THE LUT
TECHNICAL FIELD
[001] Disclosed are embodiments related to video compression and filtering.
BACKGROUND
[002] Bilateral filtering of image data directly after forming the reconstructed image block can be beneficial for video compression. As described by Wennersten et al., in“Bilateral Filtering for Video Coding” (referred to as [1] hereafter, and incorporated herein in its entirety), it is possible to reach a bit rate reduction of 0.5% with maintained visual quality for a complexity increase of 3% (encode) and 0% (decode) for random access. However, bilateral filtering involves a division, which can be expensive for hardware implementations. Therefore Wennersten et al. implemented this using a multiplication and a look-up table of 576 bytes [1] Later, a division-table-free variant of the bilateral filter form [1 ] was proposed in“Description of SDR, HDR and 360° video coding technology proposal by Qualcomm and Technicolor - low and high complexity versions” (JVET-J0021, ITU-T SG 16 EP 3 and ISO/IEC JTC l/SC 29/WG 11) (referred to as [2] hereafter).
[003] The filter weights in a bilateral filter depend on the image data, so they need to be calculated on-the-fly or obtained from a look-up-table (“LUT”). In the implementation in [1], 2 ,202 bytes were needed for this LUT. Another 576 bytes were needed for the division table, yielding 2,778 bytes in total for the solution in [1] The implementation proposed in [2] used a LUT of about 33,000 values.
SUMMARY
[004] Even with the division-free implementation, the bilateral filter is costly to implement for some forms of implementations, such as in a fully custom ASIC
implementation. In order to gain parallelism in such applications, a filter must typically be instantiated several times. This means that even a small look-up table of 2,202 bytes may be costly in terms of silicon area if it is instantiated, say, seven times. The same goes for multipliers within the filter. It is therefore of interest to further reduce the complexity of the filter in terms of LUT size and in terms of expensive operations such as multiplications.
[005] Several aspects of embodiments herein disclosed are now described at a high level. They are further described in detail below. Embodiments may include one or more of these aspects, including all of the aspects together or any other combination.
[006] LUT dimensionality reduction aspect
[007] As formulated in [2], the filter contribution from each surrounding pixel is calculated as a multiplication of three numbers; distance * range * D1. In [2], the first of these multiplications is avoided by fetching a pre -multiplied value of distance * range from a three- dimensional look-up table (LUT). This three-dimensional LUT becomes very big, around 33,000 values. Therefore a first aspect is to avoid this pre-multiplication. This means that a two- dimensional LUT can instead be used. The two-dimensional LUT is six times smaller. This solution reintroduces a multiplication; in some embodiments, this multiplication can be moved so that it is done only once per filtered pixel, yielding significant savings.
[008] Multiplication removal aspect
[009] A second aspect is to avoid the second multiplication between range and DI. This is done by pre-calculating this multiplication in the LUT instead of performing the multiplication in the filtering operation. Lortunately, this removal can be made without increasing the dimensionality of the LUT. This means that we can save four multiplications per filtered pixel. When taking into account the multiplication introduced in the first aspect, we therefore only need one multiplication per filtered pixel in total, whereas the solution in [2] may need up to four.
This is a substantial reduction.
[0010] LUT row reuse
[0011] The two-dimensional LUT we end up with depends on two parameters, the quantization parameter (qp) and the delta intensity DI. Thus one can write the full 2D LUT as a matrix where the different rows represent different qps and the different columns different d/s. However, as will be seen in the description below, two given rows in this matrix are quite similar to each other. Therefore, a third aspect is to approximate one row of the matrix using another row and a scaling transform. By doing so, the number of rows actually stored can be dramatically reduced, thereby lowering the size of the LUT by as much as a factor of 20. It should be noted that this way of reusing tables can be applied to the solution [2] as is, or to the solution [2] as modified by one or two of the two previously mentioned aspects, and it can also be applied to the solution in [1] An alternative use of this second aspect is to make each row smaller. As an example, if the longest row is 235 bytes, the scaling transformation can be used to get this down to smaller than 16. This makes it possible to implement the LUT using SIMD (single instruction, multiple data) instructions, which typically cannot handle LUTs larger than 16 elements.
[0012] Several additional advantages of one or more of these aspects are now described.
[0013] The bilateral filter is typically placed inside the intra-prediction loop. This means that when creating the intra prediction for the current block, filtered pixels from the previous block may be used. In the decoder, this means that the previous block may have to be fully reconstructed and filtered using the bilateral filter before we can start to construct the prediction for the current block.
[0014] This intra prediction can be in the critical path of the decoder. Inserting a filter into this path increases the latency for this critical path. This in turn means that the clock- frequency of the chip may need to be lowered, perhaps to a point where all pixels of a frame cannot be decoded in time. Thus it is of outmost important to be able to do this filtering as quickly as possible.
[0015] This typically means that we need more than one instantiation of the table. As an example, to filter a single pixel as quickly as possible, we would typically need four
instantiations of the LUT (the different instantiations of the LUT is described in a section below). Thus even if the LUT size is only 2,202 bytes, we would need 4*2,202=8,808 bytes to filter a single pixel quickly. There are alternative places for the filter, where it is not placed inside the intra-prediction loop. As an example, it can be placed as a loop-filter, for instance right before deblocking, after deblocking, or in parallel with sample adaptive offset filtering or in parallel with the adaptive loop filter stage. However, even in this case it will typically not be sufficient with a single instantiation of the LUT to process all of the pixels in a larger image, such as 4K or 8K resolution. Hence, also in this case several instantiations of the LUT would be necessary.
[0016] However, filtering a single pixel quickly may not be sufficient to lower latency.
The next block to the right may read any of the right-most pixels in the current block for its prediction. Therefore, all four pixels in the right-most column may need to be filtered as soon as possible. One way of doing this is to parallelize the filtering. Thankfully the bilateral filter is fully parallelizable. As an example, in a 4x4 block, all four pixels in the right-most column can be filtered in parallel without changing the result of the filtering. However, this comes at the cost of more LUT instantiations. If four LUT instantiations are needed for every filtered pixel, and four pixels need to be filtered simultaneously, then up to 4*4=16 instantiations of the LUT may be needed. This becomes 16*2,202 = 35,235 bytes. As is described in more detail below, it is possible to filter four pixels in a column using just 7 instantiations, but 7*2,202 = 15,414 bytes is still quite big and will cost silicon surface area. Also, as described below, seven multipliers would be needed to filter the four pixels when using the method described in [2], which may become troublesome in terms of size. It should again be noted that there are alternative places for the filter, other than inside the intra-prediction loop. As an example, it can be placed as a loop-filter, for instance right before deblocking, after deblocking, or in parallel with sample adaptive offset filtering or in parallel with the adaptive loop filter stage. However, even in this case it will typically not be sufficient with a single instantiation of the filter to process all of the pixel in a larger image, such as 4K or 8K resolution. Hence, also in this case several instantiations of the filter would be necessary. A typical implementation might even in this case need 16 instantiations of the LUT.
[0017] Without the first aspect, implementing [2] with 4-pixel parallelism would require
7*33,000=231 ,000 values to be stored for the LUTs. Using the first aspect in combination with [2], only about 7*33,000/6 = 38,500 values would be needed.
[0018] Without the second aspect, implementing [2] would require seven multipliers to filter four connected pixels in parallel. Using the second aspect an implementation would instead require only four multipliers. This is a substantial reduction. Furthermore, these multipliers may be smaller, i.e., using fewer bits in and out, which also translates to lower silicon area usage. [0019] With the third aspect, the size of the LUT can be further reduced. Applied to the implementation in [1], the size of an individual LUT may go down from 2,202 bytes to 200 bytes. If seven of these instantiations are needed, the combined LUT space may go down from 7*2,202=15,414 to 7*200=1 ,400 bytes. That is a substantial reduction.
[0020] All these reductions will lower the silicon area needed to implement bilateral filtering, saving cost. Furthermore, it is noted that CPU implementations will also benefit by not needing to do a multiplication.
[0021] As noted above, it is possible to put the bilateral filter outside the intra prediction loop, but inside the inter-prediction loop. (While this can lower the performance of the filter, it can still be beneficial to place it there due to latency requirements.) That means that when a block predicts from a previous block in the same image, it will use un- filtered data. This will put the filter outside the critical path for intra coding. This will make the complexity of the bilateral filter less critical, since only one instance of the filter may be needed, instead of seven or four. However, even in this case, it is of great benefit to have a low-complex filter, since this will translate to less silicon area even in this case. When predicting from a previously decoded block from a different image, filtered data will be used, putting it inside the inter prediction loop.
[0022] According to an embodiment, a method for applying bilateral filtering to a media object comprising a plurality of samples is provided. The method includes, for a current sample C, computing a filtered sample value IF based on one or more neighboring samples above (A), below ( B ), to the left ( L ), and to the right (R) of the current sample C. The filtered sample value IF is given by the equation
Figure imgf000006_0001
where:
Ic is the current sample intensity before filtering,
AIA is the difference between the current sample intensity Ic and the intensity of the sample above ( 1A ), such that AIA = 1A— lc AIB is the difference between the current sample intensity Ic and the intensity of the sample below (IB), such that AIB = IB— Ic
D is the difference between the current sample intensity Ic and the intensity of the sample to the left (7t), such that AIL = IL— 7C;
AIR is the difference between the current sample intensity Ic and the intensity of the sample to the right (lR ), such that AIR = IR— Ic;
ad is a spatial strength parameter;
ar is an intensity strength parameter; and d(ad) is given
Figure imgf000007_0001
wherein computing the filtered sample value IF comprises using a lookup table with two or fewer dimensions.
[0023] In some embodiments, the lookup table is two-dimensional and depends on ar and AI and where AI is an intensity difference AIa, AIb, AIl, and/or AIR . In some
A
embodiments, the lookup table is used to determine e 2 ar where AI is an intensity difference AIa, AIB , AIl, and/or AIR . In some embodiments, the lookup table is used to compute an
A
influence function ms?.(D/), where mst(AΪ) = e 2ar AI, such that the filtered sample value IF is given by the equation
Figure imgf000007_0002
wherein computing the filtered sample value IF involves one and only one multiplication per current sample C.
[0024] In some embodiments, the lookup table is one-dimensional and depends on AI and where AI is an intensity difference AIa, AIb, AIl, and/or AIR . In some embodiments, the lookup table is created based on a fixed value of ar (ar0 ) and when computing the filtered sample value IF for a different value of ar ( arl ), a scaling transform s(arl, AI) is applied using a constant c determined from ar0 and arl . In some embodiments, the filtered sample value IF is approximated using fixed point numbers. In some embodiments, IF is approximated by IF which is given by
Figure imgf000008_0002
» denotes arithmetic right shift and round (·) rounds to the nearest integer.
[0025] In some embodiments, using the lookup table comprises executing one or more single instruction, multiple data (SIMD) vector operations. In some embodiments, a size of a row of the lookup table is no more than 128 bits.
[0026] According to another embodiment, an encoder for applying bilateral filtering to a media object comprising a plurality of samples is provided. The encoder includes a computing unit configured to, for a current sample C, compute a filtered sample value IF based on one or more neighboring samples above (A), below (B), to the left (Z), and to the right (R) of the current sample C, wherein the filtered sample value IF is given by the equation
Figure imgf000008_0001
where:
Ic is the current sample intensity before filtering,
AIA is the difference between the current sample intensity Ic and the intensity of the sample above ( 1A ), such that AIA = 1A— Ic
AIB is the difference between the current sample intensity Ic and the intensity of the sample below (IB), such that AIB = IB— Ic
D is the difference between the current sample intensity Ic and the intensity of the sample to the left (7t), such that AIL = IL— 7C;
AIR is the difference between the current sample intensity Ic and the intensity of the sample to the right (lR ), such that AlR = IR— Ic;
ad is a spatial strength parameter;
ar is an intensity strength parameter; and e 2¾ e~2ad
given by d(ffd) = - - — = - and
l+ e 2ad+e 2ad+e 2ad +e 2ad l+4e 2ad
wherein computing the filtered sample value IF comprises using a lookup table with two or fewer dimensions.
[0027] According to another embodiment, a decoder for applying bilateral filtering to a media object comprising a plurality of samples is provided. The decoder includes a computing unit configured to, for a current sample C, compute a filtered sample value IF based on one or more neighboring samples above (A), below (B), to the left (Z), and to the right (R) of the current sample C, wherein the filtered sample value IF is given by the equation
Figure imgf000009_0001
where:
4 is the current sample intensity before filtering,
AIA is the difference between the current sample intensity Ic and the intensity of the sample above ( 1A ), such that AIA = 1A— lc
AIB is the difference between the current sample intensity Ic and the intensity of the sample below (IB), such that A1B = IB— Ic
D is the difference between the current sample intensity Ic and the intensity of the sample to the left (7t), such that AIL = IL— Ic;
AIR is the difference between the current sample intensity Ic and the intensity of the sample to the right (lR ), such that AIR = IR— Ic;
ad is a spatial strength parameter;
ar is an intensity strength parameter; and e ¾ e ¾
d(ad) is given by d ad) = - 1 - 1 1 — = -— , and
l+e 2ad+e 2ad+e 2ad +e 2ad l+4e 2ad
wherein computing the filtered sample value IF comprises using a lookup table with two or fewer dimensions. [0028] According to another embodiment, a computer program comprising instructions which when executed by processing circuity of a node causes the node to perform the method of any one of the embodiments disclosed herein is provided..
[0029] According to another embodiment, a carrier containing the computer program of embodiments is provided, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.
BRIEF DESCRIPTION OF THE DRAWINGS
[0030] The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.
[0031] FIG. 1 illustrates a plot for various quality parameter (qp) values according to one embodiment.
[0032] FIG. 2 illustrates a plot according to one embodiment.
[0033] FIG. 3 is a flow chart illustrating a process according to one embodiment.
[0034] FIG. 4 is a flow chart illustrating a process according to one embodiment.
[0035] FIG. 5 is a diagram showing functional units of a node according to one embodiment.
[0036] FIG. 6 is a block diagram of a node according to one embodiment.
DETAILED DESCRIPTION
[0037] Throughout this description we will use filtering of intensity values as an example. This traditionally refers to the Y in YCbCr. However, it should be noted that this filtering can also be used for chroma values such as Cb and Cr, or any other components from other color spaces such as ICTCP, Lab, Y’u’v’ etc.
[0038] The original filter as described in [1] filters a pixel using
Figure imgf000010_0001
where IF is the filtered pixel intensity and Ic is the center pixel intensity, i.e., the pixel intensity before filtering. The value AIA is the difference between the center pixel intensity Ic and the intensity of the pixel above IA, AIA = IA— Ic. Analogously, IB , IL and IR are the intensities of the pixel immediately below, left and right of the center pixel respectively, and MB = IB— Ic J ML = — Ic, and AIR = IR— Ic-
[0039]
Figure imgf000011_0001
or R. This is equal t
Figure imgf000011_0002
Figure imgf000011_0003
( Eqn 2)
[0040] During the filtering of a block in [1], the variables od and or are kept constant.
However, the intensity differences AIa, AIb, AIl, and AlR are changing with every pixel, since they depend on the center intensity values of the center pixel Ic and the intensity values of the surrounding pixels ( IA , IB, IL, and IR). This means that the denominator in Equation 1 will be different for every pixel. To filter the pixel, a division is needed, and that can be implemented using a division table as described in [1]
[0041] In order to avoid this division, the filter is reformulated in [2] by simply
Figure imgf000011_0004
removing the factors containing e 2ar from the denominator (but not from the numerator) of Equation 2. This results in
Figure imgf000011_0005
[0042] Since e 2 ar is strictly smaller than 1, each term in the denominator will become a bit larger, and the larger denominator in Equation 3 will thus give a weaker filtering (less deviation from the original pixel Ic ) than Equation 2 would give given the same parameters ad and ar. By letting
Figure imgf000012_0001
it is possible to rewrite Equation 3 as
Figure imgf000012_0002
( Eqn 5)
A
[0043] Note that the variable d ad) is denoted Distancek in [2] Also, e 2ar is
A
denoted Rangek in [2] By further defining
Figure imgf000012_0003
2 <7r ,
Equation 5 is further simplified in [2] to
Figure imgf000012_0004
( Eqn 6)
[0044] Please note that upper case W is different from lower case w here; as
Figure imgf000012_0005
but
Figure imgf000012_0006
[0045] In [2], Equation 6 is written using a summation symbol, which in our notation becomes
Figure imgf000012_0007
[0046] There are two limitations associated with implementing the filter according to
Equation 6 or 9. First, it is noted that the (upper case) weight WGd Gr AIx ) depends on three variables: ad, ar, and AIX. In the implementation in [2], ad can take on six different values, ar can take on 34 different values and AIX up to 1 ,023 different values. Thus if the weight
Figure imgf000013_0001
is stored as a look-up table (LUT), up to 6*34*1,023 = 208,692 values need to be stored. While this may not be much of a problem for a CPU-based implementation, it can be a serious problem for an implementation in full custom ASIC, where such a large LUT will translate to a sizable part of the available silicon area.
[0047] Typically, many of these values are small enough to be rounded to zero for a fixed precision, and they need not be tabulated. However, even when doing so, up to 33,000 values may be needed for the look-up table. This is in contrast to the implementation in [1 ], where only around 2,800 bytes are needed for the LUT and the division table. On the face of it therefore, the approximation used in [1 ] that avoids the division does not bring much in terms of savings. Instead, it looks like an increase in complexity, especially when it comes to the size of the LUT.
[0048] The second limitation with the implementation according to Equation 9 is that it mandates a multiplication inside the summation; the (upper case) weight Wad 7r (AIX) is multiplied with the intensity difference Alx. Again, this may not be much of an issue for a CPU-based implementation, but for a full custom ASIC type implementation this can be expensive, especially if the filter needs to be instantiated several times to achieve parallelism.
[0049] Therefore, in one embodiment, we instead use a different implementation which is now described in detail. We go back to Equation 5, which is hereby repeated for the convenience of the reader,
Figure imgf000013_0002
(Eqn 5)
[0050] An important first aspect of some embodiments is that the factor d(ad) is now extracted from the four terms,
Figure imgf000013_0003
[0051] An important second aspect of some embodiments is that we now define an influence function mst (D7) that depends on ar and D7 as
Figure imgf000014_0001
[0052] By using Equation 11 we can now rewrite the bracketed expression in Equation
10 as a sum of these influence terms. Equation 10 then simplifies to,
IF = IC + ά(sa) * (mst(DIA) + mst(DIB) + mst(DIB) + msG(D7k)). (Eqn 12)
[0053] We can now see that we have accomplished two things. The first aspect is that we no longer need the upper case weight W
Figure imgf000014_0002
that depends on three variables, and hence needs to be tabulated as a three-dimensional LUT. Instead we have separated out d ad) from the other terms, removing the need of the LUT to depend on the ad variable. Hence the
A
LUT can depend on two variables, st and D7. The second aspect is that we do not store e 2or
A
in the LUT, but instead we tabulate the influence function ms?.(D7) = e 2ar D7, where the multiplication with D7 has already taken place.
[0054] As for the first aspect, when filtering according to Equation 9 as in [2], up to
6*34*1,023 = 208,692 values need to be stored. But when filtering according to the second aspect of embodiments using Equation 12, a maximum of 6 + 34*1 ,023 = 34,788 values need to be stored.
[0055] Above we saw that the implementations can save LUT space by storing only non-zero values. This way the implementation in [2] can also reach about 33,000 values. However, the same thing can be used here. A relevant accuracy for the influence value may be to use two fractional bits. In this case, the value will on average be zero for values of D7 larger than 132. Thus about 133*34 = 4,522 values need to be stored. This is significantly smaller than the 33,000 values needed using the implementation in [2] by a factor of 33,000/4,522 ~ 7.
[0056] The second thing that we have accomplished is that we have removed the need for several multiplications per filtered pixel as seen in Equation 12. Due to this second aspect we can avoid the four multiplications marked by
Figure imgf000014_0003
in Equation 6 used in [2] Comparing instead with [1], we can avoid the four multiplications marked by in Equation 1. Compared to both cases, this second aspect saves multiplications. Instead one multiplication per filtered pixel is introduced (the multiplication marked by
Figure imgf000015_0001
in Equation 12). Thus, instead of four multiplications per pixel, we are down to only one multiplication per pixel. This is a substantial reduction.
[0057] As is explained below, some computation can be shared between pixels. Thus on average, instead of four multiplications per filtered pixel, implementations of [1] and [2] would need about two multiplications per filtered pixel, if the techniques described below about sharing computations are used. However, using the second aspect of embodiments, only one multiplication per filtered pixel is needed. This is still a substantial reduction over the modified versions of [1] and [2]
[0058] It should be noted that this is not merely a question of implementation. In order to avoid decoder drift, it is essential that the encoder and the decoder get exactly the same result during filtering. Therefore it is not sufficient to state an approximate value of the weights when defining the video coding standard. These values must be defined exactly down to the last bit. Also operations such as rounding must be exactly defined. Therefore, a video coding standard using bilateral filtering inside any prediction loop (be it intra or inter prediction) must not only define the values of the weights used, but also the precision and where and how the rounding happens. As an example, in [2] the value of the upper case weight
Figure imgf000015_0002
will be represented as a fix point number. As an example, if 8 fractional bits are used, the actual value stored in the look-up table will be the integer
Figure imgf000015_0003
where round (·) rounds to the nearest integer. The filtered pixel will then be calculated as
Figure imgf000016_0001
where » denotes arithmetic right shift, and the factor 27 is used to round evenly. However, in an embodiment, the filtered pixel is instead calculated as
Figure imgf000016_0002
[0059] Here we have used six fractional bits to represent d(ad ) and two fractional bits to represent msg(AI), which is a realistic precision. Note that both IF and IF are
approximations of IF, but crucially they are different approximations, since the components have been rounded differently. If the encoder uses IF and the decoder uses IF, there will be decoder drift and undefined behavior. Hence the same formula must be used in both cases, and it must be defined in the standard.
[0060] In some embodiments, the pixel being filtered will always have access to all its surrounding pixels at the time of filtering. However, in an embodiment where the filter is inside the intra prediction loop, as is the case for both [1] and [2], this cannot always be the case. As an example, assume we are filtering a block, and the block to the right has not yet been decoded. This means that a pixel situated on the right edge of the block will not have access to its right neighbor, since this neighbor belongs to the not-yet-decoded block. In this case the filtering will have to do with fewer surrounding pixels.
[0061] The solution to this situation used in [2] is to simply exclude this term in the calculation. Thus Equation 6 is changed to
Figure imgf000016_0003
in which the last term of Equation 6 has been removed.
[0062] The solution to this situation in [ 1 ] is different. Here the Equation 1 is still used, but wR is set to zero which gives
w&Al& + wBAIB + w, D/,
Ip — Ir + {Eqn 18)
1 + wA + R + w,
[0063] Since the denominator now becomes smaller, the filter in [1] compensates for the lack of information in pixel R by trusting the remaining pixels A, B and L more. This should give a better filtering, and is different from what happens in Equation 17 used in [2], which will simply filter such a pixel less strongly. Continuing with Equation 18, we can follow the same steps of approximation as we did in Equation 2 and 3 and get
Figure imgf000017_0001
[0064] In this case, this gives a different value of
Figure imgf000017_0002
Figure imgf000017_0003
where we have a 3 in the denominator instead of a 4. Analogously, for comer pixels, where two surrounding pixels are missing, and we thus only have two neighbors, the corresponding value would be
Figure imgf000017_0004
Thus when filtering, we should ideally use dedge{ad ) when filtering edge pixels, dcorner{ad) when filtering corner pixels and d{ad ) otherwise. Doing this also gives a BD rate reduction (reduction in bit rate for the same quality) as compared to always using d{ad), as is done in [2] However, if one would like to implement this using the prior art scheme from [2], this would be very expensive. The value stored in the look-up table in [2] is
Figure imgf000017_0005
and this would mean that we would need to create two more such LUTs,
Figure imgf000018_0001
Thus the total size of all the LUTs would increase by a factor of three. Thus in the prior art, the LUT size would go up from 6*34*1 ,023 = 208,692 to 3*6*34* 1023 = 626,076 values, or approximately 33,000*3 = 99,000 values if values close to zero are omitted.
[0065] In sharp contrast, accommodating this more accurate filtering using
embodiments herein disclosed results in no extra cost for the LUT table. Equation 12 can be used for pixels that are neither border pixel nor comer pixels, whereas for border pixels we can use
Figure imgf000018_0002
and for corner pixels we can use
Figure imgf000018_0003
The LUT for the influence values remains the same, and only the d{ad) values must come in three versions, increasing their total number from six to 18. Thus the cost goes from 6 +
34*1 ,023 = 34,788 values to 18 + 34*1 ,023 = 34,800 values, a negligible change.
[0066] In another embodiment, we do not make use of the second aspect, i.e., we do not bake the multiplication into the LUT. Instead of storing the value ms?.(D/) in the LUT, we store only the first part rs?.(D/) (without multiplication by D7), given by A
RsG(D7) = e 2ar
The filtered pixel is then calculated as,
Figure imgf000018_0004
{Eqn 22) where we have highlighted the multiplications using ’ . In this case we will not get the benefit of having fewer multiplications, but we can still benefit from the first aspect, namely a smaller LETT size. This is due to the fact that rs?.(D/) only depends on two variables ( ar and AI) and therefore is a 2D LUT of the same size as the LUT from mst (D7), and hence considerably smaller than the 3D LUT used for storing Wad ffr(AI ).
[0067] For the third aspect we will first consider the original filter as described in [1]
It filters a pixel using Equation 1, repeated here for the convenience of the reader.
Figure imgf000019_0001
[0068] In [1], the weights are calculated as
Figure imgf000019_0002
where X can be A, B, L or R. This is equal to
Figure imgf000019_0003
Thus the Equation 1 above can be written as
Figure imgf000019_0004
[0069] It is noted in [1 ] that by changing the 1 in the denominator, it is possible to get the same result as if ad had instead been changed. It is thus sufficient to tabulate w(ad, st, D7) for a given ad such as ad = 0.82 and replace the 1 in the denominator when another ad is needed.
[0070] Since ad can be held constant in the LUT, it can now be made a two- dimensional LUT{ar, AI ) = in(0.82, sG, D7), i.e., it only depends on the variables AI and ar. The value of ar is calculated directly from the qp value according to ar = 2 * ( qp— 17) for lO-bit data or ar =
Figure imgf000020_0001
(qp— 17) for 8-bit data. Therefore we can equivalently say that the look up table is indexed using D7 and qp: w = LUT(qp, D7). Also, since all values in the LUT are smaller than 1, we need fractional resolution when representing the LUT using integers. In [1] 65 represents 1.0, which means that the following formula is used to calculate the LUT
Figure imgf000020_0002
where round (·) rounds to the nearest integer. For 10 bit values, the difference between two intensities can range from 0-1023 = -1023 to 1023-0 = 1023. However, since the formula contains a square, it is always true that LUT(qp,— D7) = LUT(qp,Al ,), so only positive values of DI need to be tabulated.
[0071] As an example, for the lowest qp allowed, qp=l8, the values of the LUT for
D7 = 0...1023 are 7,777(18, D7) = 31, 27,19,10,4,1,0,0,0,0,0,0,0,0,0, ... ,0.
[0072] It is noted in [1 ] that it is sufficient to store the first zero. This lowers the number of integers that need to be stored significantly. However, there is still a large number of integers to be stored. As an example, for qp = 35 we have LUT (34, D7) = 31,31,31,31,31,
31, 30, 30, 30, 30, 30, 29, 29, 29, 28, 28, 28, 27, 27, 26, 26, 26, 25, 25, 24, 24, 23, 23, 22, 21, 21,
20, 20, 19, 19, 18, 18, 17, 17, 16, 15, 15, 14, 14, 13, 13, 12, 12, 11, 11, 10, 10, 10, 9,9,8, 8,8,7,
7, 7, 6, 6, 6, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, ...,0.
[0073] Likewise, for qp=5l (the highest qp) we have 7,777(51, D7) = 31, 31, 31, 31, 31,
31,31,31,31,31,31, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 29, 29, 29, 29, 29, 29, 29, 28, 28, 28, 28, 28, 27, 27, 27, 27, 27, 26, 26, 26, 26, 26, 25, 25, 25, 25, 24, 24, 24, 24, 23, 23, 23, 23, 22, 22,
22, 21, 21, 21, 21, 20, 20, 20, 20, 19, 19, 19, 18, 18, 18, 18, 17, 17, 17, 17, 16, 16, 16, 15, 15, 15,
15, 14, 14, 14, 14, 13, 13, 13, 13, 12, 12, 12, 12, 11, 11, 11, 11, 10, 10, 10, 10, 10, 9,9,9, 9,9,8,
8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1,1, 1,1, 1,1, 1,0, 0, ..., 0.
[0074] Plotting these into the same diagram, with D7 as the x-value and the weight as the y-value we get the plot illustrated in FIG. 1.
[0075] As shown in FIG. 1, it looks like the curve for qp = 51 is just a stretched version of that for qp = 34. This turns out to be exactly the case. The formula for the curve for qp =
34 is, according to Equation 24 (here we have omitted the rounding for convenience)
Figure imgf000021_0001
|D/ 12 (51 -17) 2
But 2*2Z *(34— 17) 2 can, if we multiply both numerator and denominator by— 7 73
(34—-17) 2,’ be written as
'-' (§÷¾¾ ,51 - 17,
AI * ^34 - 17) |D/ * c\2
, ( Eqn 25) 2 * 22 * (34 - 17)2 (51 - 17)2 _ 2 * 22 * (51 - 17)2 ~ 2 * 22 * (51 - 17)2
(34 - 17)2 where c = 51 17
34- 17 = 2. Hence we see that
Figure imgf000021_0002
[0076] Thus, instead of using LUT( 34, AI), we can get the same result by just taking every second value in LUT (51, AI) instead. (It is every second, since c happens to be exactly 2 in this case.)
[0077] This means that it is sufficient to store a one-dimensional LUT table, for instance LUT (51, AI). If we are interested in another value for qp, such as qp = 34, instead of storing another lD-row LUT (34, AI), we simply reuse the one for qp = 51 : LUT (34, AI) =
LUT (51, 2AI). This means that, instead of storing 3,468 values as in [1], we only need to store 197 values (the number of values in LUT (51, AI)). This is a reduction by a factor of 17.
[0078] However, we must also handle the c-values. It is not always the case that they will be as neat as in the above example, where c = 2. In theory the c-value can be calculated on the fly using
51 - 17
c = -— , ( Eqn 26)
qp— 17
but having a division inside the filtering should be avoided. Instead we could store the c- values using, for instance 8 bits; four bits for the integer part and four bits for the fractional part. This means that we would need to store 34 8-bit values for the c-values, and 197 values for LUT (51, D7). It is even possible to save further by using another table as the base table.
As an example, if qp=34 is the reference table (or base table), then only 99 values are needed. In total such an implementation would need 34 + 99*5/8 = 96 bytes of data compared to 2,202 bytes, a reduction by a factor of more than 20. This second aspect can be used even without reuse of the rows; as an example, it is possible to store one LUT row for every qp, but use every value twice in order to reduce the size of the LUT row. As an example, the LUT row used for qp 51 could be:
LUT*{ 51, D/) = 31 , 31 , 31 , 31 , 31 , 31, 30, 30, 30, 30, 30, 29, 29, 29, 28, 28, 28, 27, 27, 26, 26, 26, 25, 25, 24, 24, 23, 23, 22, 21, 21, 20, 20, 19, 19, 18, 18, 17, 17, 16, 15, 15, 14, 14, 13, 13, 12, 12, 11 , 11 , 10, 10, 10, 9, 9, 8, 8, 8, 7, 7, 7, 6, 6, 6, 5, 5, 5, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 1, 1 , 1, 1, 1, 1 , 1, 1, 1 , 1, 1, 1, 1 , 1, 0, ..., 0.
[0079] This is equivalent to LUT 34, D7) above. However, we would use a c-value of
Figure imgf000022_0001
so that the LUT-row would fit qp=51 rather than qp=34. Here 34 is the first term in the numerator since the table row in its original form was create to work for qp=34, the source qp. Also, 51 is the first term in the denominator since this is the qp that we want to use the LUT row for, the target qp. As can be seen the 7,ί7G* (51, D7) (marked with a star) has only 98 non zero value, which is half the number compared to LUT (51, D7), which has 196 non-zero values. If this is done not only for qp 51 , but for all qps, it is possible to reduce the total number of LUT values by half without even reusing LUTs. In another embodiment the scaling can be used to set a maximum number for the largest number of element in a LUT row. Many decoders and encoders are implemented in software, and in order to have an efficient implementation it is well-known that SIMD operations are often used to speed up execution of the software. There are efficient SIMD operations for table look-up operations, but they typically have a restrictions on the number of entries that can be used in the LUT. A common way to implement LUT in SIMD is to put the entire LUT in a SIMD register, which can be 128 bits. If every entry is eight bits, that means that there is only room for 16 entries in the LUT. When decoding or encoding a block, only one LUT row at a time needs to be used, since a block is restricted to a single qp. However, this still puts a restriction that every LUT row cannot be more than 16 non-zero items. There are work-arounds for using more than 16 entries. As an example, two LUT operations with two different registers can be used to obtain the equivalent of a look-up from 32 elements. However, this costs one valuable register and one extra instruction, making it harder to make the code run quickly. Hence for maximum throughput, a maximum of 16 entries should be used per row. This can be done by selecting a c-value for every q-value so that the resulting table has at most 16 non-zero values (or, alternatively, 15 non-zero values).
[0080] One downside of an embodiment using a c-value is that the act of fetching a look-up table value suddenly seems to have become more complex, since it involves a multiplication. As an example, this is the previous pseudo code for fetching the weight:
// done once per block:
lookupTablePtr =
m_bilateralFilterTable [qp-18] ; // point to the right LUT, e.g., LUT[x,34] theMaxPos = maxPosList [qp-18] ; // find where the zeros start
// done several times per pixel:
weight = lookupTablePtr [min (theMaxPos, abs (deltal) ) ] ;
[0081] Now however, the idea illustrated above instead gives the following pseudo code:
// constant variables since they never change
lookupTablePtr = m singleTable; // always points to LUT[x,34] theMaxPos = 99; // the place where the zeros start
// done once per block
c value = m c valueTable [ qp- 18 ] ; // get the correct c-value
// done several times per pixel:
weight = lookupTablePtr [min (theMaxPos , (abs (deltal) * c value) » 4)];
[0082] For a software implementation, this increase in complexity is however only superficial: For a software implementation a LUT of 2,202 bytes is typically not very big, so it is possible to go back to a 2D-table using the following code at initialization time (i.e., only once when starting the software):
for(qp = 18; qp<52; qp++) c value = m c valueTable [qp-18 ] ;
for(dI = 0; dl < maxPosList [ qp- 18 ] +1 ; dl++)
m bilateralFilterTable [ qp- 18 ] [dl]= m sigleTable [min (theMaxPos ,
(abs (dl) *c_value) » 4)]
[0083] Then the regular 2D-LUT software (first example) can be used instead. So embodiments carry no penalty for a software implementation, but can save significantly in terms of LUT space for hardware implementations.
[0084] This saved LUT space must be traded off against the accuracy of the c-values.
In the case above, we use an 8 -bit value for the c-value, and the largest non-zero value (for the highest qp of 51) is 197 which is also contained in 8 bits. Thus the multiplication can be handled using an 8 bit times 8 bit multiplier. This multiplication will also consume surface area. If we had chosen a significantly higher accuracy of the c-value, the area consumed by the multiplication would be larger than the area saved by reducing the number of LUT values. It is therefore important that the accuracy of the c-value is set sufficiently low so that a substantial reduction results.
[0085] In another embodiment of the third aspect, we avoid the multiplication by putting some restrictions on the c-values. To understand how to do this efficiently, we look at the optimal c-values for qp 17 through 51 for the case when LUT(AI, 33) is the base table:
cl8 = 16.000000
cl9 = 8.000000
c20 = 5.333333
c21 = 4.000000
c22 = 3.200000
c23 = 2.666667
c24 = 2.285714
c25 = 2.000000
c26 = 1.777778 c27 = 1.600000
c28 = 1.454545
c29 = 1.333333
c30 = 1.230769
c3l = 1.142857
c32 = 1.066667
c33 = 1.000000
c34 = 0.941176
c35 = 0.888889
c36 = 0.842105
c37 = 0.800000
c38 = 0.761905
c39 = 0.727273
c40 = 0.695652
c4l = 0.666667
c42 = 0.640000
c43 = 0.615385
c44 = 0.592593
c45 = 0.571429
c46 = 0.551724
c47 = 0.533333
c48 = 0.516129
c49 = 0.500000
c50 = 0.484848
c5l = 0.470588
[0086] As can be seen in the table, quite a few of these (the ones in boldface) are pure powers of two. This means that if we restrict the c-value to be a power of two, these will still be well represented. Furthermore, in the beginning of the list, even the ones that are not powers of two are often immediately next to one that is a pure power of two. As an example, the c-value for qp 20 is 5.333, which is not a pure power of two, but in this case it is likely OK to use the c-value for qp 21 (which is 4) or the c-value for 19 (which is 8) instead. The reason for why this can work is that the mapping from qp-value to ar is not something that is derived from some strict principle, but rather something that seems to work reasonably well. If something works reasonably well for qp 21 it may also work reasonably well for qp 20.
[0087] However, the lower half of the table is quite sparsely populated by c-values that are pure powers of two. Therefore it is likely going to hurt the performance somewhat.
However, this can be mitigated by having not one, but two or more base-tables. As an example, there is no power-of-two c-value between 34 and 47, but using a second base table at qp=4l would cut this long stretch in half. Also, it would cut the second longest stretch from 26 to 32 in half.
[0088] As an example, we have tested using two base tables LUT(32, x) and LUT(39, x) and c-values that are only powers of two. This gave a result where no significant BD-rate degradation could be measured for the filter. If this is done, c-values need to be stored differently, since they will no longer represent a multiplication but instead a bit shift. In the example above, bit shifts from -1 (division by two) to +4 (multiplication by 16) are used. This can be stored in 3 bits. If we have two base tables we also need to store an index to tell which base-table to use for a given qp. We call this the base_index. When we have only two base tables, the base index for each qp only needs one bit. Thus in total 4 bits of information need to be stored per qp, in total 4*34/8 = 17 bytes of information. Also, LUT(32,x) consists of 88 5-bit values, and LUT(39, x) consists of 128 5-bit values, giving another (88+128)^5/8 = 135 bytes. In total 152 bytes need to be stored compared to 2202 bytes in [1], a factor of over 14.
[0089] The pseudo-code to do the lookup would look something like this:
// constant variables since they never change
tablePtr[0] = m_baseTable32; // always points to LUT[x,32] tablePtr[l] = m_baseTable39 ; // always points to LUT[x,39] maxPosp[2] = {88,128}; // the place where the zeros start
// done once per block
c shift = m c valueTable [ qp- 18 ] ; // get the correct c-value lookupTablePtr = tablePtr [base index [ qp—18] ] ; // get the correct base table theMaxPos = maxPos [base index [ qp—18] ] ; // get the correct max-pos
// done several times per pixel: weight = lookupTablePtr [min (theMaxPos , (abs(deltal) « c shift)];
[0090] Again, it should be noted that one does not need to implement it this way; just as before it is possible to expand this to one 2D-table and use the simpler code to access it.
However, for a hardware implementation, this is very inexpensive to implement. In another embodiment we use four tables, for instance the tables for 50, 46, 42, and 38.
[0091] In an alternative embodiment, one may be interested in reducing the maximum size of a LUT to a fixed number, such as 16, in order to facilitate efficient SIMD
implementation. This can also be achieved using just shifts, since it is possible to use a smaller table and then do a right-shift to simulate c-values smaller than one. In this case, it is possible to use the following LUT rows, all of which are smaller than 16:
LUT*(18,M) {255, 225, 155, 83, 35, 11, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, };
LUT* (19, M {255, 247, 225, 192, 155, 117, 83, 55, 35, 20, 11, 6, 3, 1, 1, 0, };
LUT* (20, M {255,246,215, 167, 117, 73,41,21,9,4, 1,0, 0, 0, 0, 0, };
LUT* (21, M {255, 250, 231, 201, 164, 126, 91, 62, 39, 23, 13, 7, 3, 2, 1, 0, };
LUT* (22, M {254, 239, 192, 132, 77, 39, 17, 6, 2, 1, 0, 0, 0, 0, 0, 0, };
LUT* (23, M {255, 243, 209, 161, 111, 69, 38, 19, 8, 3, 1, 0, 0, 0, 0, 0, };
LUT* (24, M {255, 246, 220, 182, 138, 97, 63, 37, 21, 10, 5, 2, 1, 0, 0, 0, };
LUT* (25, M {255, 248, 228, 197, 159, 121, 87, 58, 37, 22, 12, 6, 3, 1, 1, 0, };
LUT* (26, M ) {254, 232, 176, 109, 56, 24, 8, 2, 1, 0, 0, 0, 0, 0, 0, 0, };
LUT* (27, M {254, 236, 188, 128, 74, 37, 16, 6, 2, 1, 0, 0, 0, 0, 0, 0, };
LUT* (28, M {254, 239, 198, 144, 92, 51, 25, 11, 4, 1, 1, 0, 0, 0, 0, 0, };
LUT* (29, M {254, 242, 206, 158, 108, 66, 36, 18, 8, 3, 1, 1, 0, 0, 0, 0, };
LUT* (30, M {254, 244, 213, 169, 123, 81, 48, 26, 13, 6, 2, 1, 0, 0, 0, 0, };
LUT* (31, M {254, 245, 218, 179, 136, 95, 61, 36, 20, 10, 5, 2, 1, 0, 0, 0, };
LUT* (32, M {255, 246, 223, 187, 147, 107, 73, 46, 27, 15, 8, 4, 2, 1, 0, 0, };
LUT* (33, M {255, 247, 226, 194, 157, 119, 85, 57, 36, 21, 12, 6, 3, 1, 1, 0, };
LUT* (34, M {255, 248, 229, 201, 166, 130, 96, 68, 45, 28, 17, 9, 5, 3, 1, 1, };
LUT* (35, M {253, 231, 174, 107, 55, 23, 8, 2, 1, 0, 0, 0, 0, 0, 0, 0, };
LUT* (36, M {253,233, 180, 117,64, 29, 11,4, 1,0, 0, 0, 0, 0, 0, 0, }; LUT* (37, AI) = {254, 235, 187, 126, 73, 36, 15, 5, 2, 1, 0, 0, 0, 0, 0, 0, } ;
LUT* (38, AI) = {254, 237, 192, 135, 82, 43, 20, 8, 3, 1, 0, 0, 0, 0, 0, 0, };
LUT* 39, I) = {254, 239, 197, 143, 91, 50, 25, 11, 4, 1, 1, 0, 0, 0, 0, 0, };
LUT* 40, M) = {254, 240, 201, 150, 99, 58, 30, 14, 6, 2, 1, 0, 0, 0, 0, 0, };
LUT* 41, M) = {254, 241, 205, 156, 107, 65, 36, 18, 8, 3, 1, 1, 0, 0, 0, 0, };
Lt/T* (42,D/) = {254, 242, 209, 162, 114, 73, 42, 22, 10, 4, 2, 1, 0, 0, 0, 0, };
LUT*( 3, AI) = {254, 243, 212, 168, 121 , 80, 48, 26, 13, 6, 2, 1, 0, 0, 0, 0, };
LUT* 44, M) = {254, 244, 215, 173, 128, 87, 54, 31, 16, 8, 3, 1, 1, 0, 0, 0, };
Lt/r (45, D7) = {254, 245, 217, 178, 134, 93, 60, 35, 19, 10, 5, 2, 1, 0, 0, 0, } ;
LUT* 46, M) = {254, 245, 220, 182, 140, 100, 66, 41, 23, 12, 6, 3, 1, 1, 0, 0, };
LUT* 47, M) = {254, 246, 222, 186, 146, 106, 72, 46, 27, 15, 8, 4, 2, 1, 0, 0, };
LUT* 48, M) = {254, 247, 224, 190, 151, 112, 78, 51, 31, 18, 9, 5, 2, 1, 1, 0, };
LUT* 49, I) = {254, 247, 225, 193, 156, 118, 84, 56, 35, 21, 12, 6, 3, 1, 1, 0, };
Lt/r (50, D/ ) = {254, 247, 227, 197, 160, 124, 90, 61, 40, 24, 14, 8, 4, 2, 1, 1, };
LUT* 51, A1 ) = {255, 248, 229, 200, 165, 129, 95, 67, 44, 28, 16, 9, 5, 2, 1, 1, };
[0092] Now, if we want to calculate the LUT-row for, say, qp=5l, from these compressed LUT rows, we can do it through the equation:
Figure imgf000028_0001
where half_val(5l) = 8 and shift_val(51 ) = 4. This will give the same result as using the following LUT row:
LUT(S1, A1) = {255, 255, 255, 255, 255, 255, 255, 255, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, 229, 229, 229, 229, 229, 229, 229, 229, 229, 229,
229, 229, 229, 229, 229, 229, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200,
200, 200, 200, 165, 165, 165, 165, 165, 165, 165, 165, 165, 165, 165, 165, 165, 165, 165, 165,
129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 129, 95, 95, 95, 95,
95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 95, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67,
67, 67, 67, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 28, 28, 28, 28, 28, 28,
28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,
16, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 2,2,2, 2, 2,2,2, 2, 2, 2,2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,0,0, 0, 0,0, 0, 0, 0,0,0, 0, 0, };
[0093] As can be seen, this function is step-wise constant. When multiplying this with the D7 -value, the result will be a function that is piecewise increasing (since D7 is increasing but the LUT -value is constant), followed by discontinuities when the LUT -value goes down.
In order to avoid this, it is possible to make sure that the value we multiply the LUT value with is also constant at the same time. This can be done by multiplying by
Figure imgf000029_0001
instead of multiplying by D7. Hence, we can use
Figure imgf000029_0002
instead of LUT (51, D7) * D7.
[0094] This example was how to obtain the LUT row for qp=51. To get the LUT rows for other qps, we could use the following values for halfjval and shiftjval:
halfjval = {0, 0, 1, 1, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, };
shift_val = {0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4,};
Note that, for the first two qps 18 and 19, we should not add anything (half_val = 0) and we should not shift any steps (shiftjval = 0). Also, as can be seen, the last value of half_val, representing C[p=51 , is 8, just as in the example above, and the last value of shiftjval, representing C[p=5l, is four, also in alignment with the example above.
[0095] In yet an alternative embodiment, it is possible to simplify the above by using
LUT (51, D7) * D7 and simply accept the fact that the resulting function is non-monotonic. In another embodiment, it may be possible to simplify the equations by setting half val to zero for all qps; then the extraction of a value becomes
LUT qp,AI ) = LUT*(qp,Al » shift _vaKqpf) and the multiplication with AI can be, for example,
LUT(qp, AI ) * [(D7 » shift_val(qp )) « shift _val(qp)\ if a monotonic function is needed, or LUT (51, D7) * D7 if a non-monotonic function is acceptable.
[0096] So far we have looked at this third aspect through the lens of the implementation of [1]· However, it is easy to see that this third aspect is also applicable to the filter as defined in [2] when we have applied the first aspect (going from 3D-LUT to 2D-LUT) but not the second (avoiding multiplication). As is described in conjunction with Equation 22, we then store rs?.(D7) in the LUT. Expressing this in terms of qp gives
Figure imgf000030_0001
and it is easy to see that this is just the same as the expression in Equation 24, modulus a scaling factor of 65 * e2*o. 2 . Hence the same math applies about to how to approximate one table row with the help of another using a c-value.
[0097] It is perhaps a bit trickier to see that this third aspect is also applicable when both the first and the second aspects are applied to [2] In that case, we store the influence factor ms?.(D7) as described in Equation 11. Expressing the LETT in terms of qp instead of ar gives
Figure imgf000030_0002
[0098] Plotting the rows for this LUT for all 34 different qp’s gives the plot illustrated by FIG. 2. From this plot, it is clear that only scaling along the x-axis will not work, since the functions have different heights.
[0099] However, the trick to multiply both the numerator and denominator in the
(51 -17)2
exponent by (34 -17)2 shown in Equation 25 (repeated for convenience) works also here: \AI\ (<& ¾) L ,51 - 17,
\(34— 17)2, D/ * ^34 - 17^ |D7 * cY
{Eqn 25)
Figure imgf000031_0001
Thus
|D712 |A/*c| 2
LUT (34, AI) = AI * e 2.2 .(34-17 )2 = D/ * e 2*2 5i-i7)2. {Eqn 29)
However, in this case, the right hand side does not equal LUT (51, AI * c), because that is equal to
Figure imgf000031_0002
Instead, we have that
Figure imgf000031_0003
[00100] Thus, instead of scaling only in the x-direction, we need to scale both in the x and y direction with c and
Figure imgf000031_0004
respectively. If we use an arbitrary number c, we will need to store also - so that we can multiply the result by that factor afterwards. The source code could look something like this:
// constant variables since they never change
lookupTablePtr = m singleTable; // always points to LUT [x, 51] theMaxPos = 99; // the place where the zeros start
// done once per block
c value = m c valueTable [ qp- 18 ] ; // get the correct c-value inv c value = m inv c valueTable [ qp—18] ; // get the correct 1/c value
// done several times per pixel:
weight = inv c value * lookupTablePtr [min (theMaxPos , (abs(deltal) * c value) » 4 ) ] » 4 ;
[00101] Note that for a software implementation this could again be efficiently implemented by expanding the result back into a table. For hardware though, this saves valuable LUT space. A drawback however with this solution is that it incurs an extra multiplication again. [00102] Luckily, dividing by c is easy if c is a power of 2; if c=2Ak, dividing by c is roughly equivalent to just shifting k bits. Hence in one embodiment of the third aspect, only powers of two are used, and the following pseudo code can be used to calculate the weight:
// constant variables since they never change
tablePtr [ 0 ] = m baseTable32 // always points to LUT [x, 32] tablePtr [ 1 ] = m baseTable39 // always points to LUT [x, 39] maxPosp [ 2 ] : 88 128 // the place where the zeros start
// done once per block
c shift = m c valueTable [ qp- 18 ] ; // get the correct c-value lookupTablePtr = tablePtr [base index [ qp—18] ] ; // get the correct base table theMaxPos = maxPos [base index [ qp—18] ] ; // get the correct max-pos
// done several times per pixel:
weight = lookupTablePtr [min (theMaxPos , (abs(deltal) « c shift)] » c shift;
[00103] The pseudo code above uses two table rows as base tables (32 and 39), but it is possible to use any number of rows as base table rows, including using four rows (e.g., 50, 46, 42, and 38) as base rows. Again, a software version can be much more efficiently rewritten. For a hardware implementation though this is valuable. This combines all three aspects; reducing the size of the LUT by reducing it to two dimensions, avoiding a multiplication, and finally reusing the rows of the LUT by scaling.
[00104] It should be noted that when tabulating the influence function
Figure imgf000032_0001
it is no longer true that LUT(qp,— D7) = LUT(qp, D7). This is because of the multiplication by D7, which changes the sign. However, it is instead true that LUT(qp,— D7) =—LUT(qp, D7), so it is still possible to only tabulate for positive A7s. For negative A7s we simply first negate, then fetch the value from the LUT, and then negate the fetched value.
[00105] LUT Instantiations
[00106] Here we go deeper into how many instantiations of the LUT we need in order to filter a single pixel. As an example, consider the filtering from [1] It filters a pixel using
WAAIA + WB AIB + WL AIL + WRAIR
Ip— lr + (Eqn 1)
1 + wA + wB + wL + wR where IF is the filtered pixel intensity and Ic is the center pixel intensity, i.e., the pixel intensity before filtering. The value AIA is the difference between the center pixel intensity Ic and the intensity of the pixel above IA, AIA = 1A— Ic. Analogously, IB , IL and IR are the intensities of the pixel immediately below, left and right of the center pixel respectively, and
Figure imgf000033_0001
[00107] In this case, the weights wA, wB, wL and wR are potentially all different, and they depend on the intensity differences DIA, DIB, d/L and DIK . As an example, wA is fetched from the LUT using wA = LUT (qp , D1A ~) . However, wB is fetched from a different part of the LUT using wB = LUT (jqp , DIb) . Typically, a LUT implementation cannot read out two different values at once. Hence in this case we would need two LUT instantiations in order to be able to read out wA and wB in parallel. For the same reason, to get all weights we would need four LUT instantiations.
[00108] In the following we describe how it is possible to get away with using only seven instantiations of a LUT when filtering a row or a column of four pixels in parallel.
[00109] Assume we need to filter the right-most column in the following 4x4 block.
Figure imgf000033_0002
Here we have denoted the top right pixel with“T”, the pixel to its left with“S” etc. The intensity of pixel T is denoted IT, the intensity of pixel S is denoted Is etc.
[00110] To filter pixel T we first calculate the intensity difference between its neighboring pixels S and T: DIT3 = Is— IT and DItg = lv— IT . We can now use \D1T3 \ to get the weight wTS = LUT(qp, \DIT3\ ) and likewise for the weight wTV = LUT(qp, \DItn \). The filtered pixel T can now be calculated using
Figure imgf000034_0001
where wc is a center weight that is constant for the block and therefore does not need to be looked up.
[00111] Next we filter pixel V. To do so, we need to calculate the weights for the three surrounding pixels T, U and X. We start with the above pixel, DInt = IT— Iv, and we can now calculate the weight wVT = LUT (\D1nt \, qp ). However, since DInt = IT— lv =—(Jv— 7T) = —DItn, this means that \DInt \ = \DItn\, and hence
Wyj· = TuT{\D1nt \, qp) = LUT {\DIgg\, qp') = wTV that we already looked up above. Hence we can reuse wTV and we don’t need another instantiation of the LUT. For the other two weights wvu and wvx we need instantiations of the LUT, bringing the number to four so far.
[00112] Likewise, when filtering pixel X, we can use wxv = wvx and only need two more instantiations for wxw and wxz, bringing the total so far to six.
[00113] For the last pixel, we can use wzx = wxz and we only need one more instantiation for wZY , bringing the total number of LUT instantiations needed to 7.
[00114] It should also be noted that the product wVTAIVT = wTV(—AITV) =—wTVAITV. Hence it is possible to save not only the look-up of wVT but it is also possible to avoid the multiplication wVTAIVT and instead replace it with a negation of the previously calculated value wTVAITV. This means that to filter the four pixels T, V, X and Z, a total of seven multipliers must be used if using the method described in [2] If only one multiplication is needed per filtered pixel, as when using the second aspect, only four multipliers would be needed.
[00115] Seen another way, even in the case of [1] and [2], it is possible to save some multiplications. This is due to the fact that the difference between a center pixel and its right neighbor can be reused when the right neighbor in the next step is the center pixel. In detail, if 7(34,40) is the intensity of the pixel at x = 34, y = 40, and 7(35,40) is the intensity of the pixel immediately to the right, then AIR (34,40) = (7(35,40)— 7(34,40)) when the center pixel is in position (34,40). When the center pixel is instead in position (35,40),
AIL( 35,40) = (7(34,40)— 7(35,40)) = — AIR (34,40). A similar reduction can also be made with the top and bottom pixels; D7T (34,41) = (7(34,40)— 7(34,41)) = — D7B (34,40).
However, this only saves two multiplications per pixel, resulting in two remaining
multiplications per pixel. In contrast, embodiments disclosed herein allow the same filtered pixel value to be calculated using only one multiplication per pixel.
[00116] FIG. 3 illustrates a process 300 of applying bilateral filtering to a media object comprising a plurality of samples. For each sample C in a media object, of the plurality of samples (step 302), it is determined if one or more neighbors of sample C that are above (A), below (B), to the left (L) or to the right (R) of sample C are available (step 304). As explained above, in some embodiments, the sample being filtered will always have access to all its surrounding samples at the time of filtering. However, in other embodiments, this cannot always be the case. As an example, assume we are filtering a block, and the block to the right has not yet been decoded. This means that a pixel situated on the right edge of the block will not have access to its right neighbor, since this neighbor belongs to the not-yet-decoded block. In this case the filtering will have to do with fewer surrounding samples. If all samples are available (as determined at step 306), then the filtered sample for C is computed using the available samples A, B, L, and R. If not all samples are available (as determined at step 306), then the filtered sample for C is computed using a subset of A, B, L, and R for the samples that are available (step 310). For example, step 310 may use dedBe (c¾) when filtering edge pixels and dcornerC^) when filtering comer pixels, whereas step 308 may use d(ad).
[00117] FIG. 4 illustrates a process 400 of applying bilateral filtering to a media object comprising a plurality of samples. The method includes, for a current sample C, computing a filtered sample value IF based on one or more neighboring samples above (A), below (B), to the left ( L ), and to the right (R) of the current sample C (step 402). The filtered sample value IF is given by the equation
Figure imgf000035_0001
where: Ic is the current sample intensity before filtering,
AIA is the difference between the current sample intensity Ic and the intensity of the sample above (IA), such that AIA = 1A— Ic
AIB is the difference between the current sample intensity Ic and the intensity of the sample below (IB), such that A1B = IB— Ic
D is the difference between the current sample intensity Ic and the intensity of the sample to the left (IL), such that Alh = IL— Ic
AIR is the difference between the current sample intensity Ic and the intensity of the sample to the right (IR ), such that AIR = I R— Ic ad is a spatial strength parameter; ar is an intensity strength parameter; and e ¾ e ¾
d(ad) is given by d(ad) =— Computing the l + e 2ad+e 2ad+e 2ad +e 2ad l+4e 2ad
filtered sample value IF comprises using a lookup table with two or fewer dimensions (step 404).
[00118] In embodiments, the lookup table is two-dimensional and depends on ar and AI and where AI is an intensity difference AIa, AIb, AIl, and/or AIR . In embodiments, the lookup
Figure imgf000036_0001
table is used to determine e 2ar where AI is an intensity difference AIa, AIb, AIl, and/or AIR .
In embodiments, the lookup table is used to compute an influence function msG(D/), where
Figure imgf000036_0002
such that the filtered sample value IF is given by the equation
Figure imgf000036_0003
wherein computing the filtered sample value IF involves one and only one multiplication per current sample C.
[00119] In embodiments, the lookup table is one-dimensional and depends on AI and where AI is an intensity difference AIa, AIb, AIl, and/or AIR . In embodiments, the lookup table is created based on a fixed value of st (st0 ) and when computing the filtered sample value IF for a different value of st (s), a scaling transform s(arl, AI) is applied using a constant c determined from ar0 and arl. In embodiments, the filtered sample value IF is approximated using fixed point numbers. In embodiments, IF is approximated by IF which is given by
Figure imgf000037_0001
wherein
Figure imgf000037_0002
where » denotes arithmetic right shift and round (·) rounds to the nearest integer.
[00120] In embodiments, IF is approximated by IF which is given by
Figure imgf000037_0004
and
Figure imgf000037_0003
where » denotes arithmetic right shift and round (·) rounds to the nearest integer.
[00121] In embodiments, said bilateral filtering is applied during encoding and/or decoding of the media object. In embodiments, using the lookup table comprises executing one or more single instruction, multiple data (SIMD) vector operations, and in some embodiments, a size of a row of the lookup table is no more than 128 bits. [00122] FIG. 5 is a diagram showing functional units of node 502 (e.g. an encoder/decoder) for applying bilateral filtering to a media object comprising a plurality of samples, according to an embodiment. Node 502 includes a computing unit 504. Computing unit 504 is configured to, for a current sample C, compute a filtered sample value IF based on one or more neighboring samples above (A), below (B), to the left (Z), and to the right (R) of the current sample C,
wherein the filtered sample value IF is given by the equation
Figure imgf000038_0001
where:
Ic is the current sample intensity before filtering,
AIA is the difference between the current sample intensity Ic and the intensity of the sample above (1A ), such that AIA = 1A— Ic
AIB is the difference between the current sample intensity Ic and the intensity of the sample below (IB), such that AIB = IB— Ic
D is the difference between the current sample intensity Ic and the intensity of the sample to the left (IL), such that AIL = IL— Ic
AIR is the difference between the current sample intensity Ic and the intensity of the sample to the right (IR), such that AlR = IR— Ic;
ad is a spatial strength parameter;
ar is an intensity strength parameter; and e ¾ e ¾
d(ad) is given by d(ad) = - - 1 — = -— . Computing the l+e~ 2ad +e~ 2ad +e ~ 2<Td +e~ 2<Td l+4e 2<Td
filtered sample value IF comprises using a lookup table with two or fewer dimensions.
[00123] FIG. 6 is a block diagram of node 502 (e.g., an encoder/decoder) for applying bilateral filtering to a media object comprising a plurality of samples, according to some embodiments. As shown in FIG. 6, node 502 may comprise: processing circuitry (PC) 602, which may include one or more processors (P) 655 (e.g., a general purpose microprocessor and/or one or more other processors, such as an application specific integrated circuit (ASIC), field-programmable gate arrays (FPGAs), and the like); a network interface 648 comprising a transmitter (Tx) 645 and a receiver (Rx) 647 for enabling node 502 to transmit data to and receive data from other nodes connected to a network 610 (e.g., an Internet Protocol (IP) network) to which network interface 648 is connected; and a local storage unit (a.k.a.,“data storage system”) 608, which may include one or more non-volatile storage devices and/or one or more volatile storage devices. In embodiments where PC 602 includes a programmable processor, a computer program product (CPP) 641 may be provided. CPP 641 includes a computer readable medium (CRM) 642 storing a computer program (CP) 643 comprising computer readable instructions (CRI) 644. CRM 642 may be a non-transitory computer readable medium, such as, magnetic media (e.g., a hard disk), optical media, memory devices (e.g., random access memory, flash memory), and the like. In some embodiments, the CRI 644 of computer program 643 is configured such that when executed by PC 602, the CRI causes node 502 to perform steps described herein (e.g., steps described herein with reference to the flow charts). In other embodiments, node 502 may be configured to perform steps described herein without the need for code. That is, for example, PC 602 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software.
[00124] While various embodiments of the present disclosure are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
[00125] Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.

Claims

CLAIMS:
1. A method for applying bilateral filtering to a media object comprising a plurality of samples, the method comprising: for a current sample C, computing a filtered sample value IF based on one or more neighboring samples above (A), below (B), to the left ( L ), and to the right (R) of the current sample C,
wherein the filtered sample value IF is given by the equation
Figure imgf000040_0001
where:
Ic is the current sample intensity before filtering,
AIA is the difference between the current sample intensity Ic and the intensity of the sample above ( 1A ), such that AIA = 1A— Ic
AIB is the difference between the current sample intensity Ic and the intensity of the sample below (IB), such that AIB = IB— Ic
D is the difference between the current sample intensity Ic and the intensity of the sample to the left (IL), such that AIL = IL— Ic
AIR is the difference between the current sample intensity Ic and the intensity of the sample to the right (IR), such that AlR = I R— Ic; ad is a spatial strength parameter; ar is an intensity strength parameter; and d(ad) is given
Figure imgf000040_0002
wherein computing the filtered sample value IF comprises using a lookup table with two or fewer dimensions.
2. The method of claim 1 , wherein the lookup table is two-dimensional and depends on ar and AI and where AI is an intensity difference AIa, AIb , AIl, and/or AIR .
3. The method of any one of claims 1 -2, wherein the lookup table is used to determine A
e 2 r where D7 is an intensity difference AIa, AIb , AIl, and/or AIR .
4. The method of any one of claims 1 -2, wherein the lookup table is used to compute an influence function ms?. (D7), where
Figure imgf000041_0001
such that the filtered sample value IF is given by the equation
Figure imgf000041_0002
wherein computing the filtered sample value IF involves one and only one multiplication per current sample C.
5. The method of any one of claims 1, 3, and 4, wherein the lookup table is one dimensional and depends on AI and where AI is an intensity difference AIa, AIb, AIl, and/or AIR .
6. The method of claim 5, wherein the lookup table is created based on a fixed value of sg (st0 ) and when computing the filtered sample value IF for a different value of st (s), a scaling transform s(arl, AI) is applied using a constant c determined from ar0 and arl.
7. The method of any one of claims 1-6, wherein the filtered sample value IF is approximated using fixed point numbers.
8. The method of claim 6, wherein IF is approximated by IF which is given by
Figure imgf000042_0002
and
Figure imgf000042_0001
where » denotes arithmetic right shift and round (·) rounds to the nearest integer.
9. The method of any one of claims 1-8, wherein using the lookup table comprises executing one or more single instruction, multiple data (SIMD) vector operations.
10. The method of claim 9, wherein a size of a row of the lookup table is no more than 128 bits.
11. An encoder (502) for applying bilateral filtering to a media object comprising a plurality of samples, the encoder (502) comprising: a computing unit (504) configured to, for a current sample C, compute a filtered sample value IF based on one or more neighboring samples above (A), below (B), to the left ( L ), and to the right (R) of the current sample C,
wherein the filtered sample value IF is given by the equation
Figure imgf000043_0001
where:
Ic is the current sample intensity before filtering,
AIA is the difference between the current sample intensity Ic and the intensity of the sample above ( 1A ), such that AIA = 1A— Ic
AIB is the difference between the current sample intensity Ic and the intensity of the sample below (IB), such that AIB = IB— Ic
ML is the difference between the current sample intensity Ic and the intensity of the sample to the left (4), such that AIL = IL— Ic
AIR is the difference between the current sample intensity Ic and the intensity of the sample to the right (IR), such that AIR = I R— Ic
ad is a spatial strength parameter;
ar is an intensity strength parameter; and d(ad) is given
Figure imgf000043_0002
wherein computing the filtered sample value IF comprises using a lookup table with two or fewer dimensions.
12. The encoder (502) of claim 11 , wherein the lookup table is two-dimensional and depends on ar and AI and where D/ is an intensity difference AIa, AIb, AIl, and/or AIR .
13. The encoder (502) of any one of claims 1 1-12, wherein the lookup table is used to A
determine e 2ar where AI is an intensity difference AIa, AIb, AIl, and/or AIR .
14. The encoder (502) of any one of claims 1 1-12, wherein the lookup table is used to compute an influence function mst (D7) , where
Figure imgf000044_0001
such that the filtered sample value IF is given by the equation
Figure imgf000044_0002
wherein computing the filtered sample value IF involves one and only one multiplication per current sample C.
15. The encoder (502) of any one of claims 11-14, wherein said computing unit is implemented in an Application-Specific Integrated Circuit (ASIC).
16. The encoder (502) of any one of claims 11-14, wherein using the lookup table comprises executing one or more single instruction, multiple data (SIMD) vector operations.
17. A decoder (502) for applying bilateral filtering to a media object comprising a plurality of samples, the decoder comprising: a computing unit (504) configured to, for a current sample C, compute a filtered sample value IF based on one or more neighboring samples above (A), below (B), to the left ( L ), and to the right (R) of the current sample C,
wherein the filtered sample value IF is given by the equation
Figure imgf000044_0003
where:
Ic is the current sample intensity before filtering, AIA is the difference between the current sample intensity Ic and the intensity of the sample above (7^), such that AIA = 1A— Ic
AIB is the difference between the current sample intensity Ic and the intensity of the sample below (7B), such that AIB = IB— Ic
ML is the difference between the current sample intensity Ic and the intensity of the sample to the left ( IL ), such that AIL = IL— Ic
AIR is the difference between the current sample intensity Ic and the intensity of the sample to the right (lR ), such that AIR = I R— Ic
ad is a spatial strength parameter;
ar is an intensity strength parameter; and d(ad) is given
Figure imgf000045_0001
wherein computing the filtered sample value IF comprises using a lookup table with two or fewer dimensions.
18. The decoder (502) of claim 17, wherein the lookup table is two-dimensional and depends on or and D7 and where D7 is an intensity difference AIa, AIb, AIl, and/or AIR .
19. The decoder (502) of any one of claims 17-18, wherein the lookup table is used to A
determine e 2ar where D7 is an intensity difference AIa, AIb, AIl, and/or AIR .
20. The decoder (502) of any one of claims 17-18, wherein the lookup table is used to compute an influence function mst (D7) , where
Figure imgf000045_0002
such that the filtered sample value IF is given by the equation
Figure imgf000046_0001
wherein computing the filtered sample value IF involves one and only one multiplication per current sample C.
21. The decoder (502) of any one of claims 17-20, wherein said computing unit is implemented in an Application-Specific Integrated Circuit (ASIC).
22. The decoder (502) of any one of claims 17-20 wherein using the lookup table comprises executing one or more single instruction, multiple data (SIMD) vector operations.
23. A computer program comprising instructions which when executed by processing circuitry (602) of a node (502) causes the node (502) to perform the method of any one of claims 1-10.
24. A carrier containing the computer program of claim 23, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.
PCT/EP2019/067429 2018-07-02 2019-06-28 Bilateral filter with lut avoiding unnecessary multiplication and minimizing the lut WO2020007748A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201862693198P 2018-07-02 2018-07-02
US62/693,198 2018-07-02
US201862700734P 2018-07-19 2018-07-19
US62/700,734 2018-07-19

Publications (1)

Publication Number Publication Date
WO2020007748A1 true WO2020007748A1 (en) 2020-01-09

Family

ID=67139748

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2019/067429 WO2020007748A1 (en) 2018-07-02 2019-06-28 Bilateral filter with lut avoiding unnecessary multiplication and minimizing the lut

Country Status (1)

Country Link
WO (1) WO2020007748A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150189329A1 (en) * 2013-12-25 2015-07-02 Samsung Electronics Co., Ltd. Method, apparatus, and program for encoding image, method, apparatus, and program for decoding image, and image processing system
WO2018067051A1 (en) * 2016-10-05 2018-04-12 Telefonaktiebolaget Lm Ericsson (Publ) Deringing filter for video coding

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150189329A1 (en) * 2013-12-25 2015-07-02 Samsung Electronics Co., Ltd. Method, apparatus, and program for encoding image, method, apparatus, and program for decoding image, and image processing system
WO2018067051A1 (en) * 2016-10-05 2018-04-12 Telefonaktiebolaget Lm Ericsson (Publ) Deringing filter for video coding

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
C. TOMASI ET AL: "Bilateral filtering for gray and color images", SIXTH INTERNATIONAL CONFERENCE ON COMPUTER VISION (IEEE CAT. NO.98CH36271), 1 January 1998 (1998-01-01), pages 839 - 846, XP055166574, DOI: 10.1109/ICCV.1998.710815 *
QIRONG MA ET AL: "De-ringing filter for Scalable Video Coding", 2013 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), IEEE, 15 July 2013 (2013-07-15), pages 1 - 4, XP032494487, DOI: 10.1109/ICMEW.2013.6618308 *
STRÖM J ET AL: "EE2-JVET related: Division-free bilateral filter", 6. JVET MEETING; 31-3-2017 - 7-4-2017; HOBART; (THE JOINT VIDEO EXPLORATION TEAM OF ISO/IEC JTC1/SC29/WG11 AND ITU-T SG.16 ); URL: HTTP://PHENIX.INT-EVRY.FR/JVET/,, no. JVET-F0096, 2 April 2017 (2017-04-02), XP030150774 *
STRÖM J ET AL: "EE2-JVET-E0032 Bilateral filter Test 1, Test2", 6. JVET MEETING; 31-3-2017 - 7-4-2017; HOBART; (THE JOINT VIDEO EXPLORATION TEAM OF ISO/IEC JTC1/SC29/WG11 AND ITU-T SG.16 ); URL: HTTP://PHENIX.INT-EVRY.FR/JVET/,, no. JVET-F0034, 23 March 2017 (2017-03-23), XP030150687 *
WENNERSTEN ET AL., BILATERAL FILTERING FOR VIDEO CODING
WENNERSTEN PER ET AL: "Bilateral filtering for video coding", 2017 IEEE VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), IEEE, 10 December 2017 (2017-12-10), pages 1 - 4, XP033325718, DOI: 10.1109/VCIP.2017.8305038 *
Y-W CHEN ET AL: "Description of SDR, HDR and 360° video coding technology proposal by Qualcomm and Technicolor "" low and high complexity versions", 10. JVET MEETING; 10-4-2018 - 20-4-2018; SAN DIEGO; (THE JOINT VIDEO EXPLORATION TEAM OF ISO/IEC JTC1/SC29/WG11 AND ITU-T SG.16 ); URL: HTTP://PHENIX.INT-EVRY.FR/JVET/,, no. JVET-J0021-v5, 14 April 2018 (2018-04-14), XP030151184 *

Similar Documents

Publication Publication Date Title
TWI831421B (en) New sample sets and new down-sampling schemes for linear component sample prediction
RU2745248C1 (en) Conversion ratio encoding
US7693339B2 (en) Method and apparatus for faster-than-real-time lossless compression and decompression of images
US5818532A (en) Micro architecture of video core for MPEG-2 decoder
JP7195349B2 (en) Apparatus and method for filtering in video coding
US20080075173A1 (en) Systems and Methods for Context Adaptive Video Data Preparation
CN102804165A (en) Front end processor with extendable data path
WO2019162118A1 (en) Methods and devices for linear component sample prediction using a double classification
US20230023387A1 (en) Low complexity image filter
WO2020007748A1 (en) Bilateral filter with lut avoiding unnecessary multiplication and minimizing the lut
WO2020127956A1 (en) Piecewise modeling for linear component sample prediction
GB2580078A (en) Piecewise modeling for linear component sample prediction
US20230024020A1 (en) Adaptive loop filtering
WO2020053262A1 (en) Hadamard piecewise linear approximation
아니쉬 Approximate Calculation of DCT for HEVC and JPEG Hardware Encoders

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19735299

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19735299

Country of ref document: EP

Kind code of ref document: A1