WO2012066292A1 - Video compression - Google Patents

Video compression Download PDF

Info

Publication number
WO2012066292A1
WO2012066292A1 PCT/GB2011/001619 GB2011001619W WO2012066292A1 WO 2012066292 A1 WO2012066292 A1 WO 2012066292A1 GB 2011001619 W GB2011001619 W GB 2011001619W WO 2012066292 A1 WO2012066292 A1 WO 2012066292A1
Authority
WO
WIPO (PCT)
Prior art keywords
pixel
tile
quantization
coefficients
transform
Prior art date
Application number
PCT/GB2011/001619
Other languages
French (fr)
Inventor
William Stoye
Original Assignee
Displaylink (Uk) Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Displaylink (Uk) Limited filed Critical Displaylink (Uk) Limited
Priority to EP11801804.3A priority Critical patent/EP2641399A1/en
Publication of WO2012066292A1 publication Critical patent/WO2012066292A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/18Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a set of transform coefficients
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/124Quantisation
    • H04N19/126Details of normalisation or weighting functions, e.g. normalisation matrices or variable uniform quantisers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/124Quantisation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/132Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/136Incoming video signal characteristics or properties
    • H04N19/14Coding unit complexity, e.g. amount of activity or edge presence estimation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • H04N19/149Data rate or code amount at the encoder output by estimating the code amount by means of a model, e.g. mathematical model or statistical model
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/176Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a block, e.g. a macroblock
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/60Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
    • H04N19/63Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding using sub-band based transform, e.g. wavelets

Definitions

  • This invention relates to a method of compressing a frame of pixel tiles.
  • the compression of video data is a large and wide-ranging technical field.
  • display devices such as televisions and computer monitors have increased in size and resolution and the number of sources of video has increased through the expansion of television channels and Internet sites, then the importance of saving bandwidth by compressing video has correspondingly increased.
  • Well-known technologies such as JPEG and MPEG provide compression technologies that are in extensive use throughout various different industries, particularly television broadcast and computing. These compression technologies operate on the principle that there are large temporal and spatial redundancies within video images that can be exploited to remove significant amounts of information without degrading the quality of the end user's experience of the resulting image.
  • a colour image may have twenty-four bits of information per pixel, being eight bits each for three colour channels of red, green and blue.
  • this information can be reduced to two bits per pixel without the quality of the final image overly suffering.
  • This can be achieved by dividing the image into rectangular blocks (or tiles), where each block is then subjected to a mathematical transform (such as the Discrete Cosine Transform) to produce a series of coefficients. These coefficients are then quantized (effectively divided by predetermined numbers) and the resulting compressed image data can be transmitted.
  • the data is decompressed by performing reverse quantization and reversing the chosen transform to reconstruct the original block.
  • entropy encoding to further reduce the amount of data that is actually transmitted.
  • Compression technologies that are based around the principle of transforming tiles and then quantizing the resulting coefficients are highly effective at reducing the amount of video data that then has to be transmitted. However, they are not necessarily as flexible as is desirable in the specific situation being used. It is known that certain types of images compress much better than others and techniques that are appropriate for photographic type images, such as conventional broadcast television, do not work as well with desktop type images produced by business computers, and vice versa. When bandwidth is restricted and different types of images need to be compressed a highly flexible approach to the compression of the image data is desirable.
  • United State of America Patent 5,629,780 describes a system and method for image data compression.
  • This Patent describes a method for performing colour or grayscale image compression that eliminates redundant and invisible image components.
  • the image compression uses a Discrete Cosine Transform (DCT) and each DCT coefficient yielded by the transform is quantized by an entry in a quantization matrix which determines the perceived image quality and the bit rate of the image being compressed.
  • DCT Discrete Cosine Transform
  • the method adapts or customizes the quantization matrix to the image being compressed.
  • This method has a number of disadvantages, the main two being that firstly the customised quantization matrix must be generated in real-time in an iterative process, which uses up both time and processing resources, and secondly that the customised quantization matrix must be transmitted to the decompression end of the process, which uses up bandwidth. It is therefore an object of the invention to improve upon the known art.
  • a method of compressing a frame of pixel tiles comprising receiving a frame of pixel tiles, determining a bandwidth available for each pixel tile, and for each colour channel of each pixel tile performing a transform of the pixel data to create a series of coefficients, selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients, and performing a quantization of the series of coefficients using the selected quantization level.
  • a device for compressing a frame of pixel tiles comprising an encoder arranged to receive a frame of pixel tiles, determine a bandwidth available for each pixel tile, and for each colour channel of each pixel tile perform a transform of the pixel data to create a series of coefficients, select a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients, and perform a quantization of the series of coefficients using the selected quantization level.
  • the quantization level is selected from a set of predetermined quantization levels based on the distribution of the transform coefficient values in order to meet an approximate target compressed image size.
  • the quantization level for each video tile must be provided for the quantization stage of compression.
  • the quantization level corresponds approximately to an image quality level, and straightforward execution of the compression algorithm would require that the desired quality level is an input to the compression process.
  • the compressed image size may vary wildly depending on the input image.
  • the quantization level By selecting the quantization level from a set of predetermined quantization levels using a function of the determined available bandwidth and the size of the coefficients generated during the transform step, there is no need to perform any generation of a new quantization matrix, which would waste time and processing resources, as in the prior art US Patent referred to above.
  • the invention also differs from the disclosure of this Patent in that the quantization levels or matrix used in the quantization step does not need to be transmitted to decompression end of the process, thereby saving bandwidth, as the set of predetermined quantization levels can be present at the decompression end of the process.
  • the step of selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of coefficients comprises mapping the coefficient size to one of a plurality of predetermined stages and using a quantization level pre-assigned to the mapped stage.
  • their size is determined, which may be an absolute measure or may be based on the number of significant bits that are present, for example. This size is then used to select a quantization level.
  • Predetermined stages that are specific to the determined bandwidth can be used to map the coefficient size to a specific stage which has a quantization level pre-assigned to the specific stage. In this way the quantization level is chosen. Effectively a matrix of predetermined quantization levels is used, which could be 8x8, with eight different bandwidths on one axis and eight different coefficient size ranges on the other axis.
  • the quantization level for a video tile is determined after the transform stage of compression and before quantization.
  • the input to the transform for a 64x64 video tile is 4096 pixel values, each expressed as three colour components (Y, Cr, Cb), a total of 12288 values.
  • the relevant statistics can be collected in parallel with the transform stage of compression, as each transform result is produced.
  • Figure 1 is a schematic diagram showing the processing of a video frame
  • Figure 2 is a schematic diagram of components of a computer
  • Figures 3 and 4 are schematic diagrams of components of an encoder
  • Figure 5 is a schematic diagram illustrating the transform of a tile of a video frame
  • Figure 6 is a flowchart of a method of compressing a video frame.
  • Figure 1 shows a frame 10 that is comprised of tiles 12.
  • the frame is compressed through the process steps S1 colour transform, S2 tile transform, S3 quantization and S4 entropy coding.
  • This type of compression is used for example when a server is running virtual machines or client sessions for remote client devices. For example, a single server may have twenty remote clients connected, and the server must provide twenty images, each image 011 001619
  • RGB format it is desirable to perform a colour transform to the Y, Cb, Cr domain.
  • a tile transform using, for example a Haar transform or a 5/3 Discrete Wavelet Transform (DWT).
  • the Haar transform converts an n ⁇ n tile into low frequency data, a (n/2) ⁇ (n/2) tile, where each pixel is the average of four input pixels and high frequency data, n 2 - (n/2) 2 additional values showing local pixel deltas from the low frequency values. This can be repeated multiple times on a large square, re-visiting the low frequency data at each iteration of the process.
  • the DWT transform performs exactly the same process but it uses low frequency data, a 5-tap-sinc-like-FIR-filtered sub-sampled (n/2)*(n/2) tile, taps (-1 2 6 2 -1) and high frequency data, n 2 - (n/2) 2 additional values showing local pixel deltas from the average of two or four adjacent pixels.
  • the DWT will typically produce far more tiny numbers than the Haar transform.
  • the Haar transform will usually provide a better result, because a sharp edge between two fixed values will create several non-zero coefficients with DWT.
  • the compression uses a tile size of 64 ⁇ 64 pixels.
  • the transform turns a tile into a series of coefficients and these are then quantized and entropy coding.
  • the compression uses a scheme called Run Length Golomb Rice. This has no tables but uses two adaptive parameters to scale the coded values depending on the distribution of input values.
  • the DWT transform does not break down into neat 2 x 2 groups like the Haar transform but is performed in strips (horizontally and vertically) on a large tile. This is explained in more detail below with reference to Figure 5.
  • FIG. 2 illustrates schematically some of the components of a server
  • the server 14 which has a central processing unit (CPU) 16, a graphics processing unit (GPU) 18 and a PCI encoder 20.
  • CPU central processing unit
  • GPU graphics processing unit
  • PCI encoder 20 is the principal component for compressing the multiple video streams received from the CPU 16.
  • the hardware encoder 20 takes un-coded video tiles as its input and produces coded data messages as its output. It is important for the encoder 20 to perform all of the steps (colour transform, tile transform, quantization and entropy coding) because otherwise the 10 bandwidth is increased.
  • the encoder 20 does not need to hold an entire screen image at one time. Because of this, the encoder 20 can be built on a field-programmable gate array (FPGA) which does not need any external storage (DDR). None of the processing steps need vast storage, for example a 64 * 64 tile requires 12KB in RGB form. Holding more tiles would increase the system throughput, but it need not be huge.
  • an RLGR entropy encoder or decoder will be very small, because it has no tables.
  • Pixels to be encoded are delivered optimally over PCIe from the GPU 18 to the encoder 20, by the encoder 20 doing a PCI read or the GPU 18 doing a PCI write.
  • Each PCIe lane delivers 4GB per second. If pixels are packed as 32 bits then this makes 125Mps (Mega pixels/second).
  • 125MHz is achievable and 250MHz is possible.
  • This means an n-lane PCIe interface will feed pixels at n pixels/clock (125Mhz) or n/2 pixels/clock (250MHz) into the encoder. Desirable values of n are 4, 8 and 16. There is value in keeping up with the bulk arrival rate of pixels in the encoder 20, otherwise there has to be added a storage buffer to the start of the encoder 20 which adds little value, and increases encode latency.
  • the encode stages are as follows. Firstly, there is a colour transform, which can be done in parallel on as many pixels as required. Inputs are 8 bits, output to be determined but perhaps 10 to 1 1 bits. It is possible that the Y colour channel will require more bits than Cr or Cb. If necessary, it is possible to add clip logic to control this, the whole protocol is slightly lossy, so this is acceptable if done carefully. From here on the three colour channels are split apart and dealt with in parallel.
  • the next stage is the tile transform.
  • the entire transform consists of what amounts to straightforward 3-tap and 5-tap FIR filters so it is possible to achieve whatever level of parallel operation is required.
  • the coefficients are fixed and are simple integer values, no multiplies required.
  • the encoder 20 includes a stage 1 vertical filter, which must keep up with arriving pixel rate, 5> 64 pixel delay line required. Assuming 30 bits/pixel this makes 9600 bits regardless of calculations/cycle. This is done as flip-flops because of the bandwidth required. Similarly, a stage 1 horizontal filter is needed, which must keep up with arriving pixel rate, five adjacent pixels. At this point 3 ⁇ 4 of the pixels go straight to the entropy coder, 1 ⁇ 4 (32x32) require stage 2 wavelet transformation. These naturally arrive at half the overall pixel arrival rate, so it is possible to halve the width of any processing.
  • the encoder 20 also includes a stage 2 vertical filter, 5x32 pixel delay line required, using 4800 flip-flops and a stage 2 horizontal filter, 5 adjacent pixels. At this point another 3 ⁇ 4 of the pixels go to the entropy coder, 1 ⁇ 4 (16x16) require stage 3 wavelet transformation. These arrive at a quarter of the overall pixel arrival rate.
  • the stage 3 vertical filter, with 5x16 pixel delay line required, has another 2400 flip-flops.
  • Quantization is done as a simple shift, followed by rounding. This should be done as soon as possible after the final filtering stage as it allows data path widths to be reduced.
  • the values must be fed to the entropy coder in T B2011/001619
  • Entropy coding is performed separately on each of the Y, Cr Cb channels.
  • the entropy coder is adaptive. The number of bits used to encode a value depends on two parameters kP and kRP, and these adapt as each value is encoded, depending on the size of each passing value and the current values of the parameters. This means that it is not possible to do entropy coding in parallel on values in a stream, although notably each channel of the tile is a separate stream.
  • the speed of the encoder 20 is very important. It is simple to add, subtract and compare values, but if only one per cycle can be handled then the encoder 20 cannot output more than one pixel per cycle for a tile. Ideally the encoder 20 will perform 2 or 4 per cycle. On an FPGA, 4 per cycle at 125MHz is easier than 2 per cycle at 250MHz.
  • the entropy coding can be performed at the input pixel rate. For at least part of the tile processing, the values can arrive this fast. If they are processed slower than this then there is a need for more store internally; the encoder 20 can make the CPU 16 wait longer for each tile; and there may need to be more replications of the entropy coding circuit in order to keep the whole system busy, i.e. to allow the host to feed data into the encoder 20 at the full interface rate.
  • the result of entropy coding should be built up in three separate memories, as their final size is unknown. There will need to be multiple copies of these in order to keep the whole system busy.
  • a separate command from the CPU 16 will cause the encoder 20 to write the output packet to a chosen address over PCIe.
  • the output will be much smaller than the input, compression by *5 to O ought to be achievable.
  • the CPU 16 gives one command to encode a tile. When the encoding is complete it is told the size of the result. It then gives a separate command to write the output over PCI, as where to write it may be affected by its size, for example packing into transport frames may be needed. There will need to be several copies of the output buffer in order to allow the encoder 20 to stay busy while this happens.
  • the CPU 16 has the option to re-compress with stronger quantization.
  • An improvement to the encoder 20 is to perform the transform, sum the unquantized coefficients, and choose the quantization level in an intelligent manner. The cost of this is that the entire tile (about 16KB) must be stored after the tile transform; and latency seen by the CPU 16 is increased.
  • the encoder 20 can use an input interleave. On input it is possible to load several (say four) horizontally adjacent tiles 12 at a time. The advantage of this is that PCI transfers are 1024 bytes (256 pixels) rather than 256 bytes (for one tile), so that use of the PCI bus is more efficient.
  • a store performs "de- interleave" which is analogous to the interleaving of RS codewords in a modem. To de-interleave four tiles only needs a store that can contain exactly four tiles (for example 4* 12KB) with a total bandwidth 2*PCl arrival rate.
  • the store can be organised with its own internal interleaving to get the required bandwidth with single-port RAMs. This is easiest if input and output are synchronous. This store can be organised so that any number up to the maximum can be loaded. This is carried out before the colour transform because the colour transform causes some bit growth.
  • FIG. 3 and 4 shows more detail of the encoder 20.
  • a PCI interface 22 connects to the CPU 16 over a PCI bus.
  • the interface logic 22 connects to a set of control registers 24. Downstream of the interface 22 is a colour transform unit 26 and the tile transform logic 28.
  • the tile transform unit 28 outputs to a tile store 30.
  • the tile store 30 connects to a quantization stage 32 and downstream of the quantization stage 32 is an entropy encoder 34, which connects to an output store 36.
  • n the number of PCIe lanes. Desirable values for n are 4, 8 and 16.
  • This part of the device processes the input data at "line rate", i.e. however fast it arrives over PCIe.
  • a tile is completed and stored in the tile store 30 in little more than the PCIe transfer time.
  • the entropy coder 34 shown in Figure 4, the second part of the encoding process goes more slowly because in large configurations the entropy coding cannot keep up with the rate of data arrival.
  • n2 the number of steps/cycle of RLGR entropy coding that can be achieved.
  • Likely values for n2 are 1, 2, 4 or 8, depending on clock speed and technology.
  • the number of "tiles in flight" needed to keep the engine fully busy is somewhere between n/n2 and 2xn/n2, depending on how often the CPU 16 checks whether a tile has been completed.
  • the entropy coder output must be stored again in an output store 36, as shown in Figure 4, because at this point its size is not known, so for all but the first channel the encoder 20 does not know where to put it.
  • the 12KB value suggested above means "tile not compressed at all” and is assumed to be a worst case. Some upper bound must be chosen, above which the compression has failed (i.e. the tile must be recompressed with greater quantization).
  • the tile store 30 and the output store 36 could be the same memory system or could be separate, depending on bandwidth requirements. The best structure (and interleave organisation etc.) will vary depending on n and n2.
  • the three two-dimensional DWT filters have identical logic within certain parameters.
  • the bits/value might not be the same for each channel and/or level (e.g. Y might need more than Cr, Cb).
  • Figure 5 illustrates the mechanics of the tile transform process carried out by the tile transform logic 28.
  • the individual colour components of the tile 12 are each processed three times with a DWT. Each pass of the DWT is carried out firstly vertically and then horizontally through the tile.
  • a preferred embodiment is for the CPU 16 is to provide a scatter list of output (location, size) entries. Each entry is tagged with whether a tile can be split over the end of the entry. This allows a set of network buffers to be described, where each buffer might be split up due to logical/physical translation.
  • the encoder 20 has to indicate which have been used somehow. Within a buffer there has to be allowed enough space for a TS_RFX_TILE block header. This is 19 bytes including (x, y) coordinates, length fields, quantization table fields, then the bit-packed data.
  • High performance PCIe throughput is essential for a high performance encoder product.
  • Each PCIe lane can support uncompressed data for two full high-definition 30Hz updating screens, as a theoretical maximum. So, a 16*PCIe GPU and a 16*PCIe Encoder, with everything else perfect, cannot achieve more than 32 such screens.
  • the current structure of GPU interfacing in Windows 7 does not allow movement of Windows 7 screen data direct between the GPU 18 and the encoder 20. At the very least the data would move twice over the PCIe bus: once moved by a DirectXIO primitive which copies a texture from GPU 18 to system memory, and then again as the encoder 20 issues PCIe reads to that texture.
  • the known path is for the encoder 20 to perform PCIe READs from texture memory in store. It is desirable that the encoder 20 would accept PCIe WRITE operations containing the pixel data.
  • the encoder 20 provides content-sensitive quantization.
  • the value of doing quantization after the tile store is that it allows the hardware to determine a quantization level, based on a statistical measure of the coefficient values that enter the tile store, from a set of predetermined quantization levels. If there are many large values (which implies that this tile is hard to compress) then it is desirable to increase the quantization level, so that the result is usefully compact.
  • the high level decision is to compress to available bandwidth. Based on a gross measure of current activity, each tile is given a compressed size target at the start of the compression operation.
  • the coefficients exist in ten sub-bands, ranging from low frequency (most important) to high frequency (can be quantized most) data.
  • the encode process can quantize each sub-band separately, so that each tile has ten quantization values sent with it.
  • the encoder 20 collects statistics about them.
  • the most desirable statistic to collect is the number of significant bits for a range of quality settings, all in parallel.
  • the encoder 20 picks the quality setting where the number of significant bits does not exceed the desired encoded tile size. The entropy coder will not result in precisely this many bits but on average it will be close enough to be useful.
  • a simple area of the screen will not require much space even when coded at maximum possible quality. So, there is little waste, and complexity in one area of the screen will not compromise quality in another. There are other strategies which the CPU 16 can use to balance quality in different areas, or to mend tiles which were sent at low quality and where there is now spare bandwidth available.
  • the GPU 18 subsamples the entire screen in store.
  • the subsampled screen is then passed to the encoder 20 in 32x32 tiles, reducing the initial transfer time by 3 ⁇ 4.
  • the encoder 20 operates to encode these as if the high-frequency coefficients are all zeros. Or, the transfer happens 64x64 but the encoder 20 then appears to produce 4 encoded tiles (in a 2x2 square). This reduces PCIe load a great deal for cases where the encoder 20 is going to quantize the high frequency coefficients out of existence anyway.
  • the fundamental part of the compression process is shown in Figure 6.
  • the method of compressing the frame of pixel tiles comprises, firstly step S6.1, which comprises receiving the frame of pixel tiles, secondly step S6.2 which comprises determining a bandwidth available for each pixel tile, and then for each colour channel of each pixel tile steps S6.3, S6.4 and S6.5 are repeated.
  • This method is executed by the encoder 20, receiving the frame of pixels from either the CPU 16 or the GPU 18, with the information about the bandwidth being supplies by the CPU 16.
  • the available bandwidth for each pixel tile is expressed as a bit rate per pixel.
  • Step S6.3 comprises performing a transform of the pixel data to create a series of coefficients
  • step S6.4 comprises selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients
  • step S6.5 comprises performing a quantization of the series of coefficients using the selected quantization level.
  • the tiles of the frame are compressed using a quantization level that is selected intelligently, in real-time, as the compression is carried out.
  • a desired size for the post-quantization data is known and the quantization level is selected to achieve that desired size, taking into account the coefficients that have resulted from the transform step S6.3.
  • Step S6.4 uses estimates of the final entropy-coded size mapped to a range of different quantization settings.
  • the estimates need not be exact in order to be useful, as meeting an approximate target for the encoded size of the tile is sufficient to meet design objectives concerning bandwidth management.
  • the tile is quantized using a range of quantization values, with different values for each channel and each sub-band. A number of possible estimation methods are possible, ranging in complexity.
  • the quantization settings for the tile are approximated using a single "quality" metric which is expressed as an expected bit/pixel value. These are determined by exhaustive search over a chosen set of reference images. A small finite set of quality metric settings is chosen, giving “minimum compression”, “maximum compression”, and a range of values between. Eight values would be sufficient. Each quality setting provides a quantization level for each sub-band.
  • a statistic is gathered over the coefficients. In a hardware implementation these can be done in parallel.
  • the statistic can be improved by giving special consideration to zero coefficients, where the entropy coder in use can code runs of zeros in less than one bit. When a run of zeros appears within a sub-band, after a fixed number of zeros their size is taken to be 0 rather than 1. Making this change after six consecutive zeros gives reasonable results, but the optimum will depend on the precise entropy coding system in use.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Algebra (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

A method of compressing a frame of pixel tiles comprises receiving a frame of pixel tiles, determining a bandwidth available for each pixel tile, and for each colour channel of each pixel tile performing a transform of the pixel data to create a series of coefficients, selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients, and performing a quantization of the series of coefficients using the selected quantization level. In a preferred embodiment, the selection of the quantization level comprises mapping the coefficient size to one of a plurality of predetermined stages and using a quantization level pre-assigned to the mapped stages.

Description

DESCRIPTION
VIDEO COMPRESSION
This invention relates to a method of compressing a frame of pixel tiles.
The compression of video data is a large and wide-ranging technical field. In general, as display devices such as televisions and computer monitors have increased in size and resolution and the number of sources of video has increased through the expansion of television channels and Internet sites, then the importance of saving bandwidth by compressing video has correspondingly increased. Well-known technologies such as JPEG and MPEG provide compression technologies that are in extensive use throughout various different industries, particularly television broadcast and computing. These compression technologies operate on the principle that there are large temporal and spatial redundancies within video images that can be exploited to remove significant amounts of information without degrading the quality of the end user's experience of the resulting image.
For example, a colour image may have twenty-four bits of information per pixel, being eight bits each for three colour channels of red, green and blue. Using conventional compression techniques, this information can be reduced to two bits per pixel without the quality of the final image overly suffering. This can be achieved by dividing the image into rectangular blocks (or tiles), where each block is then subjected to a mathematical transform (such as the Discrete Cosine Transform) to produce a series of coefficients. These coefficients are then quantized (effectively divided by predetermined numbers) and the resulting compressed image data can be transmitted. At the receiving end, the data is decompressed by performing reverse quantization and reversing the chosen transform to reconstruct the original block. Other steps may also occur in the process, such as entropy encoding, to further reduce the amount of data that is actually transmitted. Compression technologies that are based around the principle of transforming tiles and then quantizing the resulting coefficients are highly effective at reducing the amount of video data that then has to be transmitted. However, they are not necessarily as flexible as is desirable in the specific situation being used. It is known that certain types of images compress much better than others and techniques that are appropriate for photographic type images, such as conventional broadcast television, do not work as well with desktop type images produced by business computers, and vice versa. When bandwidth is restricted and different types of images need to be compressed a highly flexible approach to the compression of the image data is desirable.
United State of America Patent 5,629,780 describes a system and method for image data compression. This Patent describes a method for performing colour or grayscale image compression that eliminates redundant and invisible image components. The image compression uses a Discrete Cosine Transform (DCT) and each DCT coefficient yielded by the transform is quantized by an entry in a quantization matrix which determines the perceived image quality and the bit rate of the image being compressed. The method adapts or customizes the quantization matrix to the image being compressed. This method has a number of disadvantages, the main two being that firstly the customised quantization matrix must be generated in real-time in an iterative process, which uses up both time and processing resources, and secondly that the customised quantization matrix must be transmitted to the decompression end of the process, which uses up bandwidth. It is therefore an object of the invention to improve upon the known art.
According to a first aspect of the present invention, there is provided a method of compressing a frame of pixel tiles comprising receiving a frame of pixel tiles, determining a bandwidth available for each pixel tile, and for each colour channel of each pixel tile performing a transform of the pixel data to create a series of coefficients, selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients, and performing a quantization of the series of coefficients using the selected quantization level.
According to a second aspect of the present invention, there is provided a device for compressing a frame of pixel tiles comprising an encoder arranged to receive a frame of pixel tiles, determine a bandwidth available for each pixel tile, and for each colour channel of each pixel tile perform a transform of the pixel data to create a series of coefficients, select a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients, and perform a quantization of the series of coefficients using the selected quantization level.
Owing to the invention, it is possible to provide a method for compressing a video image into a compressed tile format whereby the quantization level is selected from a set of predetermined quantization levels based on the distribution of the transform coefficient values in order to meet an approximate target compressed image size. The quantization level for each video tile must be provided for the quantization stage of compression. The quantization level corresponds approximately to an image quality level, and straightforward execution of the compression algorithm would require that the desired quality level is an input to the compression process. However, at a given quality level the compressed image size may vary wildly depending on the input image.
By selecting the quantization level from a set of predetermined quantization levels using a function of the determined available bandwidth and the size of the coefficients generated during the transform step, there is no need to perform any generation of a new quantization matrix, which would waste time and processing resources, as in the prior art US Patent referred to above. The invention also differs from the disclosure of this Patent in that the quantization levels or matrix used in the quantization step does not need to be transmitted to decompression end of the process, thereby saving bandwidth, as the set of predetermined quantization levels can be present at the decompression end of the process. Preferably, the step of selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of coefficients comprises mapping the coefficient size to one of a plurality of predetermined stages and using a quantization level pre-assigned to the mapped stage. Once the coefficients have been obtained from the transform, their size is determined, which may be an absolute measure or may be based on the number of significant bits that are present, for example. This size is then used to select a quantization level. Predetermined stages (eight for example) that are specific to the determined bandwidth can be used to map the coefficient size to a specific stage which has a quantization level pre-assigned to the specific stage. In this way the quantization level is chosen. Effectively a matrix of predetermined quantization levels is used, which could be 8x8, with eight different bandwidths on one axis and eight different coefficient size ranges on the other axis.
In a server-client system with a server providing compressed images to many remote clients, the aggregate of compressed image size is likely to be the correct determining factor in deciding the quality level to use, due to finite network bandwidth. This leads to a chicken-and-egg problem in deciding the quality/quantization level to use. This becomes a problem when users display images which compress less well than expected. Typically this is caused by very noisy images, such as a screen full of tiny writing or rapidly changing or very busy video imagery. For example an image may start out as twenty-four bits/pixel (eight each of red, green, blue). If there is used quantization settings which compress a smooth image to one bit/pixel, for a noisy image, the same settings only get to two bits/pixel. As a general rule, at more extreme quantization (= lower quality) the difference is greater than at high quality. It is advantageous to compress with a requested bit/pixel level, rather than a requested quality.
One solution to this is to rely on statistical multiplexing, in order to hope that if some users require the compression of problem images, others may be less demanding. This will work on many occasions but can sometimes break down. For instance, if many users are watching the same multicast video, or in a teaching context are all requested to perform the same actions on their terminals, then this statistical assumption breaks down. The best solution to this problem of quality against bandwidth is to determine a budget of network bandwidth (i.e. compressed image size) for each compressed tile, dependent on the total amount of compression activity required in any given phase of activity in the server.
The quantization level for a video tile is determined after the transform stage of compression and before quantization. The input to the transform for a 64x64 video tile is 4096 pixel values, each expressed as three colour components (Y, Cr, Cb), a total of 12288 values. The output of the transform is 4096 coefficient values. Coefficient values can be quantized (= reduced in size) with little effect on human perception of the image, this is why the transform stage is performed. The relevant statistics can be collected in parallel with the transform stage of compression, as each transform result is produced.
Embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings, in which:-
Figure 1 is a schematic diagram showing the processing of a video frame,
Figure 2 is a schematic diagram of components of a computer,
Figures 3 and 4 are schematic diagrams of components of an encoder, Figure 5 is a schematic diagram illustrating the transform of a tile of a video frame, and
Figure 6 is a flowchart of a method of compressing a video frame.
Figure 1 shows a frame 10 that is comprised of tiles 12. The frame is compressed through the process steps S1 colour transform, S2 tile transform, S3 quantization and S4 entropy coding. This type of compression is used for example when a server is running virtual machines or client sessions for remote client devices. For example, a single server may have twenty remote clients connected, and the server must provide twenty images, each image 011 001619
representing the image of the client session that must be displayed at the respective client device. Owing to the limits on current connection technologies this type of server-client system will only work if the outgoing video data is highly compressed from the original size.
To compress a frame of video data that is expressed in conventional
RGB format it is desirable to perform a colour transform to the Y, Cb, Cr domain. There is then performed a tile transform using, for example a Haar transform or a 5/3 Discrete Wavelet Transform (DWT). The Haar transform converts an n χ n tile into low frequency data, a (n/2) χ (n/2) tile, where each pixel is the average of four input pixels and high frequency data, n2 - (n/2)2 additional values showing local pixel deltas from the low frequency values. This can be repeated multiple times on a large square, re-visiting the low frequency data at each iteration of the process.
The DWT transform performs exactly the same process but it uses low frequency data, a 5-tap-sinc-like-FIR-filtered sub-sampled (n/2)*(n/2) tile, taps (-1 2 6 2 -1) and high frequency data, n2 - (n/2)2 additional values showing local pixel deltas from the average of two or four adjacent pixels. Generally, for photographic data, the DWT will typically produce far more tiny numbers than the Haar transform. For synthetic images, the Haar transform will usually provide a better result, because a sharp edge between two fixed values will create several non-zero coefficients with DWT.
The compression uses a tile size of 64 χ 64 pixels. The transform turns a tile into a series of coefficients and these are then quantized and entropy coding. The compression uses a scheme called Run Length Golomb Rice. This has no tables but uses two adaptive parameters to scale the coded values depending on the distribution of input values. In relation to the tile transform, the DWT transform does not break down into neat 2 x 2 groups like the Haar transform but is performed in strips (horizontally and vertically) on a large tile. This is explained in more detail below with reference to Figure 5.
Figure 2 illustrates schematically some of the components of a server
14, which has a central processing unit (CPU) 16, a graphics processing unit (GPU) 18 and a PCI encoder 20. Obviously other components of the server 14 would also be present, such as several different memory devices and other interfaces, but these have been omitted for clarity purposes. The GPU 18 is for controlling the output of a local display device connected to the computer and can be used in the compression of images if required. The PCI encoder 20 is the principal component for compressing the multiple video streams received from the CPU 16.
The hardware encoder 20 takes un-coded video tiles as its input and produces coded data messages as its output. It is important for the encoder 20 to perform all of the steps (colour transform, tile transform, quantization and entropy coding) because otherwise the 10 bandwidth is increased. The encoder 20 does not need to hold an entire screen image at one time. Because of this, the encoder 20 can be built on a field-programmable gate array (FPGA) which does not need any external storage (DDR). None of the processing steps need vast storage, for example a 64 * 64 tile requires 12KB in RGB form. Holding more tiles would increase the system throughput, but it need not be huge. In hardware, an RLGR entropy encoder or decoder will be very small, because it has no tables.
Pixels to be encoded are delivered optimally over PCIe from the GPU 18 to the encoder 20, by the encoder 20 doing a PCI read or the GPU 18 doing a PCI write. Each PCIe lane delivers 4GB per second. If pixels are packed as 32 bits then this makes 125Mps (Mega pixels/second). For an FPGA-based system, 125MHz is achievable and 250MHz is possible. This means an n-lane PCIe interface will feed pixels at n pixels/clock (125Mhz) or n/2 pixels/clock (250MHz) into the encoder. Desirable values of n are 4, 8 and 16. There is value in keeping up with the bulk arrival rate of pixels in the encoder 20, otherwise there has to be added a storage buffer to the start of the encoder 20 which adds little value, and increases encode latency.
A full high-definition video at 30Hz is 60Mpixels/second, so an n-lane PCIe interface can drive 2*n screens at this update rate. In practical commercial systems it would be reasonable to support more screens than this, but this is a higher level product specification choice. At this level, it is more relevant to think of the number of worst-case sessions that can be supported. P T/GB2011/001619
8
Also, there may be other bottlenecks in the system, such as the GPU, the CPU or any LAN interface that is present.
The encode stages are as follows. Firstly, there is a colour transform, which can be done in parallel on as many pixels as required. Inputs are 8 bits, output to be determined but perhaps 10 to 1 1 bits. It is possible that the Y colour channel will require more bits than Cr or Cb. If necessary, it is possible to add clip logic to control this, the whole protocol is slightly lossy, so this is acceptable if done carefully. From here on the three colour channels are split apart and dealt with in parallel.
The next stage is the tile transform. The entire transform consists of what amounts to straightforward 3-tap and 5-tap FIR filters so it is possible to achieve whatever level of parallel operation is required. The coefficients are fixed and are simple integer values, no multiplies required.
The encoder 20 includes a stage 1 vertical filter, which must keep up with arriving pixel rate, 5> 64 pixel delay line required. Assuming 30 bits/pixel this makes 9600 bits regardless of calculations/cycle. This is done as flip-flops because of the bandwidth required. Similarly, a stage 1 horizontal filter is needed, which must keep up with arriving pixel rate, five adjacent pixels. At this point ¾ of the pixels go straight to the entropy coder, ¼ (32x32) require stage 2 wavelet transformation. These naturally arrive at half the overall pixel arrival rate, so it is possible to halve the width of any processing.
The encoder 20 also includes a stage 2 vertical filter, 5x32 pixel delay line required, using 4800 flip-flops and a stage 2 horizontal filter, 5 adjacent pixels. At this point another ¾ of the pixels go to the entropy coder, ¼ (16x16) require stage 3 wavelet transformation. These arrive at a quarter of the overall pixel arrival rate. The stage 3 vertical filter, with 5x16 pixel delay line required, has another 2400 flip-flops. There is also a stage 3 horizontal filter, 5 adjacent pixels. Of these, another ¼ (8x8) go through a final delta-coding stage. These are effectively the DC values of 8 8 sub-portions of the whole tile.
Quantization is done as a simple shift, followed by rounding. This should be done as soon as possible after the final filtering stage as it allows data path widths to be reduced. The values must be fed to the entropy coder in T B2011/001619
the correct order. About ¾ of the tile can feed directly from the tile filter into the entropy coder, but ¼ of the tile (the 1024 pixels that require stage2/stage3 processing) have to be stored while the rest are entropy coded. This store is SRAM, though it is not large (4KB). Its bandwidth must be at least half the pixel arrival rate.
Entropy coding is performed separately on each of the Y, Cr Cb channels. The entropy coder is adaptive. The number of bits used to encode a value depends on two parameters kP and kRP, and these adapt as each value is encoded, depending on the size of each passing value and the current values of the parameters. This means that it is not possible to do entropy coding in parallel on values in a stream, although fortunately each channel of the tile is a separate stream.
The speed of the encoder 20 is very important. It is simple to add, subtract and compare values, but if only one per cycle can be handled then the encoder 20 cannot output more than one pixel per cycle for a tile. Ideally the encoder 20 will perform 2 or 4 per cycle. On an FPGA, 4 per cycle at 125MHz is easier than 2 per cycle at 250MHz.
Ideally, the entropy coding can be performed at the input pixel rate. For at least part of the tile processing, the values can arrive this fast. If they are processed slower than this then there is a need for more store internally; the encoder 20 can make the CPU 16 wait longer for each tile; and there may need to be more replications of the entropy coding circuit in order to keep the whole system busy, i.e. to allow the host to feed data into the encoder 20 at the full interface rate. The result of entropy coding should be built up in three separate memories, as their final size is unknown. There will need to be multiple copies of these in order to keep the whole system busy.
Once the process is complete a separate command from the CPU 16 will cause the encoder 20 to write the output packet to a chosen address over PCIe. In typical use, the output will be much smaller than the input, compression by *5 to O ought to be achievable. The CPU 16 gives one command to encode a tile. When the encoding is complete it is told the size of the result. It then gives a separate command to write the output over PCI, as where to write it may be affected by its size, for example packing into transport frames may be needed. There will need to be several copies of the output buffer in order to allow the encoder 20 to stay busy while this happens.
An alternative approach is to process multiple tiles in parallel. This removes some tricky cases for the register transfer language (RTL). However, as a general rule this will increase the size of the encoder 20 because there is more storage required: an initial memory pool to absorb PCIe input; additional copies of the filter delay lines; more output buffers. The CPU 16 gets a very low latency service. Encoding is complete shortly after the tile transfer has completed (perhaps little more than ¼ of a tile transfer time). The CPU 16 only needs 2 to 3 tiles in flight in order to keep the encoder 20 fully busy.
If the output of the compression of the tile 12 is still very big, because this tile 12 has a lot of detail, then the CPU 16 has the option to re-compress with stronger quantization. An improvement to the encoder 20 is to perform the transform, sum the unquantized coefficients, and choose the quantization level in an intelligent manner. The cost of this is that the entire tile (about 16KB) must be stored after the tile transform; and latency seen by the CPU 16 is increased.
The encoder 20 can use an input interleave. On input it is possible to load several (say four) horizontally adjacent tiles 12 at a time. The advantage of this is that PCI transfers are 1024 bytes (256 pixels) rather than 256 bytes (for one tile), so that use of the PCI bus is more efficient. A store performs "de- interleave" which is analogous to the interleaving of RS codewords in a modem. To de-interleave four tiles only needs a store that can contain exactly four tiles (for example 4* 12KB) with a total bandwidth 2*PCl arrival rate. The store can be organised with its own internal interleaving to get the required bandwidth with single-port RAMs. This is easiest if input and output are synchronous. This store can be organised so that any number up to the maximum can be loaded. This is carried out before the colour transform because the colour transform causes some bit growth.
Figures 3 and 4 shows more detail of the encoder 20. A PCI interface 22 connects to the CPU 16 over a PCI bus. The interface logic 22 connects to a set of control registers 24. Downstream of the interface 22 is a colour transform unit 26 and the tile transform logic 28. The tile transform unit 28 outputs to a tile store 30. The tile store 30 connects to a quantization stage 32 and downstream of the quantization stage 32 is an entropy encoder 34, which connects to an output store 36.
In relation to the Discrete Wavelet Transform, the diagram is scaled by the factor n, the number of PCIe lanes. Desirable values for n are 4, 8 and 16. This part of the device processes the input data at "line rate", i.e. however fast it arrives over PCIe. A tile is completed and stored in the tile store 30 in little more than the PCIe transfer time. In relation to the entropy coder 34, shown in Figure 4, the second part of the encoding process goes more slowly because in large configurations the entropy coding cannot keep up with the rate of data arrival. This is scaled by the factor n2, the number of steps/cycle of RLGR entropy coding that can be achieved. Likely values for n2 are 1, 2, 4 or 8, depending on clock speed and technology.
In relation to the tile store 30, the number of "tiles in flight" needed to keep the engine fully busy is somewhere between n/n2 and 2xn/n2, depending on how often the CPU 16 checks whether a tile has been completed.
It would be possible to perform the quantization before the tile store 30. The value of performing it afterwards is that it allows the hardware to determine a quantization level, based on some statistical measure of the values that enter the tile store. If there are many large values (meaning that the current tile is hard to compress) then it may be desirable to increase the quantization level, so that the result is usefully compact. Quantization of the DC values (top left 8x2 pixels, LL3 channel) is done before delta coding.
The entropy coder output must be stored again in an output store 36, as shown in Figure 4, because at this point its size is not known, so for all but the first channel the encoder 20 does not know where to put it. The 12KB value suggested above means "tile not compressed at all" and is assumed to be a worst case. Some upper bound must be chosen, above which the compression has failed (i.e. the tile must be recompressed with greater quantization). The tile store 30 and the output store 36 could be the same memory system or could be separate, depending on bandwidth requirements. The best structure (and interleave organisation etc.) will vary depending on n and n2.
The complete tile is copied over PCIe to the intended destination address in host memory. To make best use of the interface 22 this should drive all n PCIe lanes at full rate. When this happens there is no input traffic, so this is an advantage for having a combined tile store 30 and output store 36. Tile input (over PCIe) and tile output (over PCIe) are not active simultaneously. This also makes it desirable to have the tile store 30 and the output store 36 as the same memory.
The three two-dimensional DWT filters have identical logic within certain parameters. The bits/value, might not be the same for each channel and/or level (e.g. Y might need more than Cr, Cb). The values/cycle, scaled to fit required throughput within a tile and vertical spacing, 64/32/16 depending on level. No multipliers are needed because the coefficients are all simple integers. Figure 5 illustrates the mechanics of the tile transform process carried out by the tile transform logic 28. The individual colour components of the tile 12 are each processed three times with a DWT. Each pass of the DWT is carried out firstly vertically and then horizontally through the tile.
Logically the CPU 16 must perform two operations for each tile. Firstly, the CPU 16 must start a compress operation. The CPU 16 must tell the GPU 18 to write to encoder 20 (or tell the encoder 20 to read from the GPU 18). There are a few parameters such as quantization level and RLGR1/3 selection. An output tile ID (= index in tile store and output store) should also be selected. The second operation is that once compression is complete, data in the output store must be identifiable by the tile ID quoted in the first operation. The CPU 16 has to check pass/fail, read output size, decide where the data is required and tell the encoder 20 to write to required destination. The intended output address could be specified at the start and this reduces the CPU involvement in each tile, but only works if the address of each does not rely on the compressed size of the previous tile.
A preferred embodiment is for the CPU 16 is to provide a scatter list of output (location, size) entries. Each entry is tagged with whether a tile can be split over the end of the entry. This allows a set of network buffers to be described, where each buffer might be split up due to logical/physical translation. The encoder 20 has to indicate which have been used somehow. Within a buffer there has to be allowed enough space for a TS_RFX_TILE block header. This is 19 bytes including (x, y) coordinates, length fields, quantization table fields, then the bit-packed data.
When under high system load, higher compression factors have to be used. Assuming a working average compression factor of six, this leads to a four bits/pixel output. One full high-definition times 30Hz = 60M pixels/second = 240Mbps. Gigabit Ethernet ports will quickly get saturated by not many screens. Therefore, when all users are watching movies the encoder 20 has to compress far more than a factor of six. Two bits/pixel at 5 frames per second is probably near the low end before users start to notice degradation in quality. Reducing the refresh rate saves work all round and reducing the quality increases the asymmetry between encoder input and encoder output. Therefore content-based quantization selection within the encoder 20 is a very useful solution to provide compression flexibility. The encoder 20 will compress by the "right" amount far more often, when compared with the CPU guess the quantization level and then re-encoding is an expensive operation.
High performance PCIe throughput is essential for a high performance encoder product. Each PCIe lane can support uncompressed data for two full high-definition 30Hz updating screens, as a theoretical maximum. So, a 16*PCIe GPU and a 16*PCIe Encoder, with everything else perfect, cannot achieve more than 32 such screens. The current structure of GPU interfacing in Windows 7 does not allow movement of Windows 7 screen data direct between the GPU 18 and the encoder 20. At the very least the data would move twice over the PCIe bus: once moved by a DirectXIO primitive which copies a texture from GPU 18 to system memory, and then again as the encoder 20 issues PCIe reads to that texture.
If the GPU 18 has 16xPCIe and the CPU 16 does a good job of this copy then this reduces the theoretical maximum bandwidth, in particular for a high performance encoder with 16*PCIe lanes would likely be halved in its throughput. Therefore the known path is for the encoder 20 to perform PCIe READs from texture memory in store. It is desirable that the encoder 20 would accept PCIe WRITE operations containing the pixel data.
The encoder 20 provides content-sensitive quantization. The value of doing quantization after the tile store is that it allows the hardware to determine a quantization level, based on a statistical measure of the coefficient values that enter the tile store, from a set of predetermined quantization levels. If there are many large values (which implies that this tile is hard to compress) then it is desirable to increase the quantization level, so that the result is usefully compact.
For the CPU 16, the high level decision is to compress to available bandwidth. Based on a gross measure of current activity, each tile is given a compressed size target at the start of the compression operation. The coefficients exist in ten sub-bands, ranging from low frequency (most important) to high frequency (can be quantized most) data. The encode process can quantize each sub-band separately, so that each tile has ten quantization values sent with it.
As the transform coefficients are stored, the encoder 20 collects statistics about them. The most desirable statistic to collect is the number of significant bits for a range of quality settings, all in parallel. When the tile is ready for entropy coding, the encoder 20 picks the quality setting where the number of significant bits does not exceed the desired encoded tile size. The entropy coder will not result in precisely this many bits but on average it will be close enough to be useful.
A simple area of the screen will not require much space even when coded at maximum possible quality. So, there is little waste, and complexity in one area of the screen will not compromise quality in another. There are other strategies which the CPU 16 can use to balance quality in different areas, or to mend tiles which were sent at low quality and where there is now spare bandwidth available.
There is a further adaptation that can be used when the computer 14 is under heavy load. The GPU 18 subsamples the entire screen in store. The subsampled screen is then passed to the encoder 20 in 32x32 tiles, reducing the initial transfer time by ¾. The encoder 20 operates to encode these as if the high-frequency coefficients are all zeros. Or, the transfer happens 64x64 but the encoder 20 then appears to produce 4 encoded tiles (in a 2x2 square). This reduces PCIe load a great deal for cases where the encoder 20 is going to quantize the high frequency coefficients out of existence anyway.
The fundamental part of the compression process is shown in Figure 6. The method of compressing the frame of pixel tiles comprises, firstly step S6.1, which comprises receiving the frame of pixel tiles, secondly step S6.2 which comprises determining a bandwidth available for each pixel tile, and then for each colour channel of each pixel tile steps S6.3, S6.4 and S6.5 are repeated. This method is executed by the encoder 20, receiving the frame of pixels from either the CPU 16 or the GPU 18, with the information about the bandwidth being supplies by the CPU 16. Preferably, the available bandwidth for each pixel tile is expressed as a bit rate per pixel.
Step S6.3 comprises performing a transform of the pixel data to create a series of coefficients, step S6.4 comprises selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients, and step S6.5 comprises performing a quantization of the series of coefficients using the selected quantization level. In this way, the tiles of the frame are compressed using a quantization level that is selected intelligently, in real-time, as the compression is carried out. Essentially a desired size for the post-quantization data is known and the quantization level is selected to achieve that desired size, taking into account the coefficients that have resulted from the transform step S6.3.
Step S6.4 uses estimates of the final entropy-coded size mapped to a range of different quantization settings. The estimates need not be exact in order to be useful, as meeting an approximate target for the encoded size of the tile is sufficient to meet design objectives concerning bandwidth management. The tile is quantized using a range of quantization values, with different values for each channel and each sub-band. A number of possible estimation methods are possible, ranging in complexity.
In one simple embodiment, the quantization settings for the tile are approximated using a single "quality" metric which is expressed as an expected bit/pixel value. These are determined by exhaustive search over a chosen set of reference images. A small finite set of quality metric settings is chosen, giving "minimum compression", "maximum compression", and a range of values between. Eight values would be sufficient. Each quality setting provides a quantization level for each sub-band.
For each quality metric setting, a statistic is gathered over the coefficients. In a hardware implementation these can be done in parallel. The statistic is the sum over all the coefficients of number of significant bits in the coefficient - quantization shift value for this quality metric, for this sub-band or 1 if this is <= 0. Having gathered this metric, select the quality level where the statistic most closely meets the desired target tile size in bits. The statistic can be improved by giving special consideration to zero coefficients, where the entropy coder in use can code runs of zeros in less than one bit. When a run of zeros appears within a sub-band, after a fixed number of zeros their size is taken to be 0 rather than 1. Making this change after six consecutive zeros gives reasonable results, but the optimum will depend on the precise entropy coding system in use.

Claims

A method of compressing a frame of pixel tiles comprising:
receiving a frame of pixel tiles,
determining a bandwidth available for each pixel tile, and for each colour channel of each pixel tile:
■ performing a transform of the pixel data to create a series of coefficients,
■ selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients, and
■ performing a quantization of the series of coefficients using the selected quantization level.
2. A method according to claim 1, wherein the available bandwidth for each pixel tile is expressed as a bit rate per pixel.
3. A method according to claim 1 or 2, and further comprising varying the bandwidth available for different tiles of the same pixel tile.
4. A method according to claim 1 , 2 or 3, wherein the transform comprises a Haar transform or a Discrete Wavelet Transform.
5. A method according to any preceding claim, and further comprising performing entropy coding of the quantized series of coefficients.
6. A method according to any preceding claim, wherein the step of selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of coefficients comprises mapping the coefficient size to one of a plurality of predetermined levels and using a quantization level pre-assigned to the mapped level.
7. A device for compressing a frame of pixel tiles comprising an encoder arranged to:
o receive a frame of pixel tiles,
o determine a bandwidth available for each pixel tile, and o for each colour channel of each pixel tile:
perform a transform of the pixel data to create a series of coefficients,
■ select a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients, and
■ perform a quantization of the series of coefficients using the selected quantization level.
8. A device according to claim 7, wherein the available bandwidth for each pixel tile is expressed as a bit rate per pixel.
9. A device according to claim 7 or 8, wherein the encoder is further arranged to vary the bandwidth available for different tiles of the same pixel tile.
10. A device according to claim 7, 8 or 9, wherein the transform comprises a Haar transform or a Discrete Wavelet Transform.
11. A device according to any one of claims 7 to 10, wherein the encoder is further arranged to perform entropy coding of the quantized series of coefficients.
12. A device according to any one of claims 7 to 11 , wherein the encoder is arranged, when selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of coefficients, to map the coefficient size to one of a plurality of predetermined stages and using a quantization level pre-assigned to the mapped stage.
PCT/GB2011/001619 2010-11-19 2011-11-18 Video compression WO2012066292A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP11801804.3A EP2641399A1 (en) 2010-11-19 2011-11-18 Video compression

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1019602.0 2010-11-19
GB1019602.0A GB2485576B (en) 2010-11-19 2010-11-19 Video compression

Publications (1)

Publication Number Publication Date
WO2012066292A1 true WO2012066292A1 (en) 2012-05-24

Family

ID=43431693

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2011/001619 WO2012066292A1 (en) 2010-11-19 2011-11-18 Video compression

Country Status (3)

Country Link
EP (1) EP2641399A1 (en)
GB (1) GB2485576B (en)
WO (1) WO2012066292A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107205150A (en) * 2017-07-14 2017-09-26 西安万像电子科技有限公司 Coding method and device
US11150857B2 (en) 2017-02-08 2021-10-19 Immersive Robotics Pty Ltd Antenna control for mobile device communication
US11153604B2 (en) 2017-11-21 2021-10-19 Immersive Robotics Pty Ltd Image compression for digital reality
US11151749B2 (en) 2016-06-17 2021-10-19 Immersive Robotics Pty Ltd. Image compression method and apparatus
US11553187B2 (en) 2017-11-21 2023-01-10 Immersive Robotics Pty Ltd Frequency component selection for image compression

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2509169B (en) * 2012-12-21 2018-04-18 Displaylink Uk Ltd Management of memory for storing display data
GB2606502B (en) * 2017-10-16 2023-01-04 Displaylink Uk Ltd Encoding and transmission of display data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5629780A (en) 1994-12-19 1997-05-13 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration Image data compression having minimum perceptual error
US5991454A (en) * 1997-10-06 1999-11-23 Lockheed Martin Coporation Data compression for TDOA/DD location system
US6301392B1 (en) * 1998-09-03 2001-10-09 Intel Corporation Efficient methodology to select the quantization threshold parameters in a DWT-based image compression scheme in order to score a predefined minimum number of images into a fixed size secondary storage
US6348945B1 (en) * 1996-09-06 2002-02-19 Sony Corporation Method and device for encoding data
US20060072838A1 (en) * 2000-10-12 2006-04-06 Chui Charles K Multi-resolution image data management system and method based on tiled wavelet-like transform and distinct bitstreams for distinct groups of bit planes
US20070053598A1 (en) * 2005-09-06 2007-03-08 Megachips Lsi Solutions Inc. Compression encoder, compression encoding method and program

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5629780A (en) 1994-12-19 1997-05-13 The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration Image data compression having minimum perceptual error
US6348945B1 (en) * 1996-09-06 2002-02-19 Sony Corporation Method and device for encoding data
US5991454A (en) * 1997-10-06 1999-11-23 Lockheed Martin Coporation Data compression for TDOA/DD location system
US6301392B1 (en) * 1998-09-03 2001-10-09 Intel Corporation Efficient methodology to select the quantization threshold parameters in a DWT-based image compression scheme in order to score a predefined minimum number of images into a fixed size secondary storage
US20060072838A1 (en) * 2000-10-12 2006-04-06 Chui Charles K Multi-resolution image data management system and method based on tiled wavelet-like transform and distinct bitstreams for distinct groups of bit planes
US20070053598A1 (en) * 2005-09-06 2007-03-08 Megachips Lsi Solutions Inc. Compression encoder, compression encoding method and program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MASAKI NAKAGAWA ET AL: "DCT-BASED STILL IMAGE COMPRESSION ICS WITH BIT-RATE CONTROL", IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 38, no. 3, 1 August 1992 (1992-08-01), pages 711 - 716, XP000311915, ISSN: 0098-3063, DOI: 10.1109/30.156759 *
See also references of EP2641399A1

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11151749B2 (en) 2016-06-17 2021-10-19 Immersive Robotics Pty Ltd. Image compression method and apparatus
US11150857B2 (en) 2017-02-08 2021-10-19 Immersive Robotics Pty Ltd Antenna control for mobile device communication
US11429337B2 (en) 2017-02-08 2022-08-30 Immersive Robotics Pty Ltd Displaying content to users in a multiplayer venue
CN107205150A (en) * 2017-07-14 2017-09-26 西安万像电子科技有限公司 Coding method and device
US11153604B2 (en) 2017-11-21 2021-10-19 Immersive Robotics Pty Ltd Image compression for digital reality
US11553187B2 (en) 2017-11-21 2023-01-10 Immersive Robotics Pty Ltd Frequency component selection for image compression

Also Published As

Publication number Publication date
GB2485576B (en) 2013-06-26
GB2485576A (en) 2012-05-23
GB201019602D0 (en) 2010-12-29
EP2641399A1 (en) 2013-09-25

Similar Documents

Publication Publication Date Title
WO2012066292A1 (en) Video compression
CN101569170B (en) Encoding device, encoding method, decoding device, and decoding method
US7106911B2 (en) Image processing apparatus and control method for inputting image data and encoding the data
RU2404534C2 (en) Adaptive procedure of coefficients scanning
US8395634B2 (en) Method and apparatus for processing information
US9621900B1 (en) Motion-based adaptive quantization
US10249059B2 (en) Lossless compression of fragmented image data
US20170140502A1 (en) Method and system for rescaling image files
AU2004220878A1 (en) Method and apparatus for improving video quality of low bit-rate video
EP1078529B1 (en) Method and apparatus for increasing memory resource utilization in an information stream decoder
US20020141499A1 (en) Scalable programmable motion image system
WO2023246047A1 (en) Jpeg image compression method and system, device, and storage medium
US10785485B1 (en) Adaptive bit rate control for image compression
US10110896B2 (en) Adaptive motion JPEG encoding method and system
CN102017636B (en) Image decoding
US10003802B1 (en) Motion-based adaptive quantization
KR20220019285A (en) Method and encoder for encoding a sequence of frames
US20120033727A1 (en) Efficient video codec implementation
CN105472442A (en) Out-chip buffer compression system for superhigh-definition frame rate up-conversion
CN108881915B (en) Device and method for playing video based on DSC (differential scanning sequence) coding technology
US5828849A (en) Method to derive edge extensions for wavelet transforms and inverse wavelet transforms
GB2488094A (en) Image compression using sum and difference pixel replacement and lowest bit discarding
CN106954074B (en) Video data processing method and device
US20110242112A1 (en) Display device and driving circuit thereof
US10003803B1 (en) Motion-based adaptive quantization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11801804

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2011801804

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2011801804

Country of ref document: EP