WO2012066292A1

WO2012066292A1 - Video compression

Info

Publication number: WO2012066292A1
Application number: PCT/GB2011/001619
Authority: WO
Inventors: William Stoye
Original assignee: Displaylink (Uk) Limited
Priority date: 2010-11-19
Filing date: 2011-11-18
Publication date: 2012-05-24
Also published as: GB2485576B; GB2485576A; GB201019602D0; EP2641399A1

Abstract

A method of compressing a frame of pixel tiles comprises receiving a frame of pixel tiles, determining a bandwidth available for each pixel tile, and for each colour channel of each pixel tile performing a transform of the pixel data to create a series of coefficients, selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients, and performing a quantization of the series of coefficients using the selected quantization level. In a preferred embodiment, the selection of the quantization level comprises mapping the coefficient size to one of a plurality of predetermined stages and using a quantization level pre-assigned to the mapped stages.

Description

DESCRIPTION

VIDEO COMPRESSION

This invention relates to a method of compressing a frame of pixel tiles.

The compression of video data is a large and wide-ranging technical field. In general, as display devices such as televisions and computer monitors have increased in size and resolution and the number of sources of video has increased through the expansion of television channels and Internet sites, then the importance of saving bandwidth by compressing video has correspondingly increased. Well-known technologies such as JPEG and MPEG provide compression technologies that are in extensive use throughout various different industries, particularly television broadcast and computing. These compression technologies operate on the principle that there are large temporal and spatial redundancies within video images that can be exploited to remove significant amounts of information without degrading the quality of the end user's experience of the resulting image.

For example, a colour image may have twenty-four bits of information per pixel, being eight bits each for three colour channels of red, green and blue. Using conventional compression techniques, this information can be reduced to two bits per pixel without the quality of the final image overly suffering. This can be achieved by dividing the image into rectangular blocks (or tiles), where each block is then subjected to a mathematical transform (such as the Discrete Cosine Transform) to produce a series of coefficients. These coefficients are then quantized (effectively divided by predetermined numbers) and the resulting compressed image data can be transmitted. At the receiving end, the data is decompressed by performing reverse quantization and reversing the chosen transform to reconstruct the original block. Other steps may also occur in the process, such as entropy encoding, to further reduce the amount of data that is actually transmitted. Compression technologies that are based around the principle of transforming tiles and then quantizing the resulting coefficients are highly effective at reducing the amount of video data that then has to be transmitted. However, they are not necessarily as flexible as is desirable in the specific situation being used. It is known that certain types of images compress much better than others and techniques that are appropriate for photographic type images, such as conventional broadcast television, do not work as well with desktop type images produced by business computers, and vice versa. When bandwidth is restricted and different types of images need to be compressed a highly flexible approach to the compression of the image data is desirable.

United State of America Patent 5,629,780 describes a system and method for image data compression. This Patent describes a method for performing colour or grayscale image compression that eliminates redundant and invisible image components. The image compression uses a Discrete Cosine Transform (DCT) and each DCT coefficient yielded by the transform is quantized by an entry in a quantization matrix which determines the perceived image quality and the bit rate of the image being compressed. The method adapts or customizes the quantization matrix to the image being compressed. This method has a number of disadvantages, the main two being that firstly the customised quantization matrix must be generated in real-time in an iterative process, which uses up both time and processing resources, and secondly that the customised quantization matrix must be transmitted to the decompression end of the process, which uses up bandwidth. It is therefore an object of the invention to improve upon the known art.

According to a first aspect of the present invention, there is provided a method of compressing a frame of pixel tiles comprising receiving a frame of pixel tiles, determining a bandwidth available for each pixel tile, and for each colour channel of each pixel tile performing a transform of the pixel data to create a series of coefficients, selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients, and performing a quantization of the series of coefficients using the selected quantization level.

According to a second aspect of the present invention, there is provided a device for compressing a frame of pixel tiles comprising an encoder arranged to receive a frame of pixel tiles, determine a bandwidth available for each pixel tile, and for each colour channel of each pixel tile perform a transform of the pixel data to create a series of coefficients, select a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients, and perform a quantization of the series of coefficients using the selected quantization level.

Owing to the invention, it is possible to provide a method for compressing a video image into a compressed tile format whereby the quantization level is selected from a set of predetermined quantization levels based on the distribution of the transform coefficient values in order to meet an approximate target compressed image size. The quantization level for each video tile must be provided for the quantization stage of compression. The quantization level corresponds approximately to an image quality level, and straightforward execution of the compression algorithm would require that the desired quality level is an input to the compression process. However, at a given quality level the compressed image size may vary wildly depending on the input image.

By selecting the quantization level from a set of predetermined quantization levels using a function of the determined available bandwidth and the size of the coefficients generated during the transform step, there is no need to perform any generation of a new quantization matrix, which would waste time and processing resources, as in the prior art US Patent referred to above. The invention also differs from the disclosure of this Patent in that the quantization levels or matrix used in the quantization step does not need to be transmitted to decompression end of the process, thereby saving bandwidth, as the set of predetermined quantization levels can be present at the decompression end of the process. Preferably, the step of selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of coefficients comprises mapping the coefficient size to one of a plurality of predetermined stages and using a quantization level pre-assigned to the mapped stage. Once the coefficients have been obtained from the transform, their size is determined, which may be an absolute measure or may be based on the number of significant bits that are present, for example. This size is then used to select a quantization level. Predetermined stages (eight for example) that are specific to the determined bandwidth can be used to map the coefficient size to a specific stage which has a quantization level pre-assigned to the specific stage. In this way the quantization level is chosen. Effectively a matrix of predetermined quantization levels is used, which could be 8x8, with eight different bandwidths on one axis and eight different coefficient size ranges on the other axis.

In a server-client system with a server providing compressed images to many remote clients, the aggregate of compressed image size is likely to be the correct determining factor in deciding the quality level to use, due to finite network bandwidth. This leads to a chicken-and-egg problem in deciding the quality/quantization level to use. This becomes a problem when users display images which compress less well than expected. Typically this is caused by very noisy images, such as a screen full of tiny writing or rapidly changing or very busy video imagery. For example an image may start out as twenty-four bits/pixel (eight each of red, green, blue). If there is used quantization settings which compress a smooth image to one bit/pixel, for a noisy image, the same settings only get to two bits/pixel. As a general rule, at more extreme quantization (= lower quality) the difference is greater than at high quality. It is advantageous to compress with a requested bit/pixel level, rather than a requested quality.

One solution to this is to rely on statistical multiplexing, in order to hope that if some users require the compression of problem images, others may be less demanding. This will work on many occasions but can sometimes break down. For instance, if many users are watching the same multicast video, or in a teaching context are all requested to perform the same actions on their terminals, then this statistical assumption breaks down. The best solution to this problem of quality against bandwidth is to determine a budget of network bandwidth (i.e. compressed image size) for each compressed tile, dependent on the total amount of compression activity required in any given phase of activity in the server.

The quantization level for a video tile is determined after the transform stage of compression and before quantization. The input to the transform for a 64x64 video tile is 4096 pixel values, each expressed as three colour components (Y, Cr, Cb), a total of 12288 values. The output of the transform is 4096 coefficient values. Coefficient values can be quantized (= reduced in size) with little effect on human perception of the image, this is why the transform stage is performed. The relevant statistics can be collected in parallel with the transform stage of compression, as each transform result is produced.

Embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings, in which:-

Figure 1 is a schematic diagram showing the processing of a video frame,

Figure 2 is a schematic diagram of components of a computer,

Figures 3 and 4 are schematic diagrams of components of an encoder, Figure 5 is a schematic diagram illustrating the transform of a tile of a video frame, and

Figure 6 is a flowchart of a method of compressing a video frame.

Figure 1 shows a frame 10 that is comprised of tiles 12. The frame is compressed through the process steps S1 colour transform, S2 tile transform, S3 quantization and S4 entropy coding. This type of compression is used for example when a server is running virtual machines or client sessions for remote client devices. For example, a single server may have twenty remote clients connected, and the server must provide twenty images, each image 011 001619

representing the image of the client session that must be displayed at the respective client device. Owing to the limits on current connection technologies this type of server-client system will only work if the outgoing video data is highly compressed from the original size.

To compress a frame of video data that is expressed in conventional

RGB format it is desirable to perform a colour transform to the Y, Cb, Cr domain. There is then performed a tile transform using, for example a Haar transform or a 5/3 Discrete Wavelet Transform (DWT). The Haar transform converts an n ^χ n tile into low frequency data, a (n/2) ^χ (n/2) tile, where each pixel is the average of four input pixels and high frequency data, n² - (n/2)² additional values showing local pixel deltas from the low frequency values. This can be repeated multiple times on a large square, re-visiting the low frequency data at each iteration of the process.

The DWT transform performs exactly the same process but it uses low frequency data, a 5-tap-sinc-like-FIR-filtered sub-sampled (n/2)*(n/2) tile, taps (-1 2 6 2 -1) and high frequency data, n² - (n/2)² additional values showing local pixel deltas from the average of two or four adjacent pixels. Generally, for photographic data, the DWT will typically produce far more tiny numbers than the Haar transform. For synthetic images, the Haar transform will usually provide a better result, because a sharp edge between two fixed values will create several non-zero coefficients with DWT.

The compression uses a tile size of 64 ^χ 64 pixels. The transform turns a tile into a series of coefficients and these are then quantized and entropy coding. The compression uses a scheme called Run Length Golomb Rice. This has no tables but uses two adaptive parameters to scale the coded values depending on the distribution of input values. In relation to the tile transform, the DWT transform does not break down into neat 2 x 2 groups like the Haar transform but is performed in strips (horizontally and vertically) on a large tile. This is explained in more detail below with reference to Figure 5.

Figure 2 illustrates schematically some of the components of a server

14, which has a central processing unit (CPU) 16, a graphics processing unit (GPU) 18 and a PCI encoder 20. Obviously other components of the server 14 would also be present, such as several different memory devices and other interfaces, but these have been omitted for clarity purposes. The GPU 18 is for controlling the output of a local display device connected to the computer and can be used in the compression of images if required. The PCI encoder 20 is the principal component for compressing the multiple video streams received from the CPU 16.

The hardware encoder 20 takes un-coded video tiles as its input and produces coded data messages as its output. It is important for the encoder 20 to perform all of the steps (colour transform, tile transform, quantization and entropy coding) because otherwise the 10 bandwidth is increased. The encoder 20 does not need to hold an entire screen image at one time. Because of this, the encoder 20 can be built on a field-programmable gate array (FPGA) which does not need any external storage (DDR). None of the processing steps need vast storage, for example a 64 * 64 tile requires 12KB in RGB form. Holding more tiles would increase the system throughput, but it need not be huge. In hardware, an RLGR entropy encoder or decoder will be very small, because it has no tables.

Pixels to be encoded are delivered optimally over PCIe from the GPU 18 to the encoder 20, by the encoder 20 doing a PCI read or the GPU 18 doing a PCI write. Each PCIe lane delivers 4GB per second. If pixels are packed as 32 bits then this makes 125Mps (Mega pixels/second). For an FPGA-based system, 125MHz is achievable and 250MHz is possible. This means an n-lane PCIe interface will feed pixels at n pixels/clock (125Mhz) or n/2 pixels/clock (250MHz) into the encoder. Desirable values of n are 4, 8 and 16. There is value in keeping up with the bulk arrival rate of pixels in the encoder 20, otherwise there has to be added a storage buffer to the start of the encoder 20 which adds little value, and increases encode latency.

A full high-definition video at 30Hz is 60Mpixels/second, so an n-lane PCIe interface can drive 2*n screens at this update rate. In practical commercial systems it would be reasonable to support more screens than this, but this is a higher level product specification choice. At this level, it is more relevant to think of the number of worst-case sessions that can be supported. P T/GB2011/001619

8

Also, there may be other bottlenecks in the system, such as the GPU, the CPU or any LAN interface that is present.

The encode stages are as follows. Firstly, there is a colour transform, which can be done in parallel on as many pixels as required. Inputs are 8 bits, output to be determined but perhaps 10 to 1 1 bits. It is possible that the Y colour channel will require more bits than Cr or Cb. If necessary, it is possible to add clip logic to control this, the whole protocol is slightly lossy, so this is acceptable if done carefully. From here on the three colour channels are split apart and dealt with in parallel.

The next stage is the tile transform. The entire transform consists of what amounts to straightforward 3-tap and 5-tap FIR filters so it is possible to achieve whatever level of parallel operation is required. The coefficients are fixed and are simple integer values, no multiplies required.

The encoder 20 includes a stage 1 vertical filter, which must keep up with arriving pixel rate, 5> 64 pixel delay line required. Assuming 30 bits/pixel this makes 9600 bits regardless of calculations/cycle. This is done as flip-flops because of the bandwidth required. Similarly, a stage 1 horizontal filter is needed, which must keep up with arriving pixel rate, five adjacent pixels. At this point ¾ of the pixels go straight to the entropy coder, ¼ (32x32) require stage 2 wavelet transformation. These naturally arrive at half the overall pixel arrival rate, so it is possible to halve the width of any processing.

The encoder 20 also includes a stage 2 vertical filter, 5x32 pixel delay line required, using 4800 flip-flops and a stage 2 horizontal filter, 5 adjacent pixels. At this point another ¾ of the pixels go to the entropy coder, ¼ (16x16) require stage 3 wavelet transformation. These arrive at a quarter of the overall pixel arrival rate. The stage 3 vertical filter, with 5x16 pixel delay line required, has another 2400 flip-flops. There is also a stage 3 horizontal filter, 5 adjacent pixels. Of these, another ¼ (8x8) go through a final delta-coding stage. These are effectively the DC values of 8 8 sub-portions of the whole tile.

Quantization is done as a simple shift, followed by rounding. This should be done as soon as possible after the final filtering stage as it allows data path widths to be reduced. The values must be fed to the entropy coder in T B2011/001619

the correct order. About ¾ of the tile can feed directly from the tile filter into the entropy coder, but ¼ of the tile (the 1024 pixels that require stage2/stage3 processing) have to be stored while the rest are entropy coded. This store is SRAM, though it is not large (4KB). Its bandwidth must be at least half the pixel arrival rate.

Entropy coding is performed separately on each of the Y, Cr Cb channels. The entropy coder is adaptive. The number of bits used to encode a value depends on two parameters kP and kRP, and these adapt as each value is encoded, depending on the size of each passing value and the current values of the parameters. This means that it is not possible to do entropy coding in parallel on values in a stream, although fortunately each channel of the tile is a separate stream.

The speed of the encoder 20 is very important. It is simple to add, subtract and compare values, but if only one per cycle can be handled then the encoder 20 cannot output more than one pixel per cycle for a tile. Ideally the encoder 20 will perform 2 or 4 per cycle. On an FPGA, 4 per cycle at 125MHz is easier than 2 per cycle at 250MHz.

Ideally, the entropy coding can be performed at the input pixel rate. For at least part of the tile processing, the values can arrive this fast. If they are processed slower than this then there is a need for more store internally; the encoder 20 can make the CPU 16 wait longer for each tile; and there may need to be more replications of the entropy coding circuit in order to keep the whole system busy, i.e. to allow the host to feed data into the encoder 20 at the full interface rate. The result of entropy coding should be built up in three separate memories, as their final size is unknown. There will need to be multiple copies of these in order to keep the whole system busy.

Once the process is complete a separate command from the CPU 16 will cause the encoder 20 to write the output packet to a chosen address over PCIe. In typical use, the output will be much smaller than the input, compression by *5 to O ought to be achievable. The CPU 16 gives one command to encode a tile. When the encoding is complete it is told the size of the result. It then gives a separate command to write the output over PCI, as where to write it may be affected by its size, for example packing into transport frames may be needed. There will need to be several copies of the output buffer in order to allow the encoder 20 to stay busy while this happens.

An alternative approach is to process multiple tiles in parallel. This removes some tricky cases for the register transfer language (RTL). However, as a general rule this will increase the size of the encoder 20 because there is more storage required: an initial memory pool to absorb PCIe input; additional copies of the filter delay lines; more output buffers. The CPU 16 gets a very low latency service. Encoding is complete shortly after the tile transfer has completed (perhaps little more than ¼ of a tile transfer time). The CPU 16 only needs 2 to 3 tiles in flight in order to keep the encoder 20 fully busy.

If the output of the compression of the tile 12 is still very big, because this tile 12 has a lot of detail, then the CPU 16 has the option to re-compress with stronger quantization. An improvement to the encoder 20 is to perform the transform, sum the unquantized coefficients, and choose the quantization level in an intelligent manner. The cost of this is that the entire tile (about 16KB) must be stored after the tile transform; and latency seen by the CPU 16 is increased.

The encoder 20 can use an input interleave. On input it is possible to load several (say four) horizontally adjacent tiles 12 at a time. The advantage of this is that PCI transfers are 1024 bytes (256 pixels) rather than 256 bytes (for one tile), so that use of the PCI bus is more efficient. A store performs "de- interleave" which is analogous to the interleaving of RS codewords in a modem. To de-interleave four tiles only needs a store that can contain exactly four tiles (for example 4* 12KB) with a total bandwidth 2*PCl arrival rate. The store can be organised with its own internal interleaving to get the required bandwidth with single-port RAMs. This is easiest if input and output are synchronous. This store can be organised so that any number up to the maximum can be loaded. This is carried out before the colour transform because the colour transform causes some bit growth.

Figures 3 and 4 shows more detail of the encoder 20. A PCI interface 22 connects to the CPU 16 over a PCI bus. The interface logic 22 connects to a set of control registers 24. Downstream of the interface 22 is a colour transform unit 26 and the tile transform logic 28. The tile transform unit 28 outputs to a tile store 30. The tile store 30 connects to a quantization stage 32 and downstream of the quantization stage 32 is an entropy encoder 34, which connects to an output store 36.

In relation to the Discrete Wavelet Transform, the diagram is scaled by the factor n, the number of PCIe lanes. Desirable values for n are 4, 8 and 16. This part of the device processes the input data at "line rate", i.e. however fast it arrives over PCIe. A tile is completed and stored in the tile store 30 in little more than the PCIe transfer time. In relation to the entropy coder 34, shown in Figure 4, the second part of the encoding process goes more slowly because in large configurations the entropy coding cannot keep up with the rate of data arrival. This is scaled by the factor n2, the number of steps/cycle of RLGR entropy coding that can be achieved. Likely values for n2 are 1, 2, 4 or 8, depending on clock speed and technology.

In relation to the tile store 30, the number of "tiles in flight" needed to keep the engine fully busy is somewhere between n/n2 and 2xn/n2, depending on how often the CPU 16 checks whether a tile has been completed.

It would be possible to perform the quantization before the tile store 30. The value of performing it afterwards is that it allows the hardware to determine a quantization level, based on some statistical measure of the values that enter the tile store. If there are many large values (meaning that the current tile is hard to compress) then it may be desirable to increase the quantization level, so that the result is usefully compact. Quantization of the DC values (top left 8x2 pixels, LL3 channel) is done before delta coding.

The entropy coder output must be stored again in an output store 36, as shown in Figure 4, because at this point its size is not known, so for all but the first channel the encoder 20 does not know where to put it. The 12KB value suggested above means "tile not compressed at all" and is assumed to be a worst case. Some upper bound must be chosen, above which the compression has failed (i.e. the tile must be recompressed with greater quantization). The tile store 30 and the output store 36 could be the same memory system or could be separate, depending on bandwidth requirements. The best structure (and interleave organisation etc.) will vary depending on n and n2.

The complete tile is copied over PCIe to the intended destination address in host memory. To make best use of the interface 22 this should drive all n PCIe lanes at full rate. When this happens there is no input traffic, so this is an advantage for having a combined tile store 30 and output store 36. Tile input (over PCIe) and tile output (over PCIe) are not active simultaneously. This also makes it desirable to have the tile store 30 and the output store 36 as the same memory.

The three two-dimensional DWT filters have identical logic within certain parameters. The bits/value, might not be the same for each channel and/or level (e.g. Y might need more than Cr, Cb). The values/cycle, scaled to fit required throughput within a tile and vertical spacing, 64/32/16 depending on level. No multipliers are needed because the coefficients are all simple integers. Figure 5 illustrates the mechanics of the tile transform process carried out by the tile transform logic 28. The individual colour components of the tile 12 are each processed three times with a DWT. Each pass of the DWT is carried out firstly vertically and then horizontally through the tile.

Logically the CPU 16 must perform two operations for each tile. Firstly, the CPU 16 must start a compress operation. The CPU 16 must tell the GPU 18 to write to encoder 20 (or tell the encoder 20 to read from the GPU 18). There are a few parameters such as quantization level and RLGR1/3 selection. An output tile ID (= index in tile store and output store) should also be selected. The second operation is that once compression is complete, data in the output store must be identifiable by the tile ID quoted in the first operation. The CPU 16 has to check pass/fail, read output size, decide where the data is required and tell the encoder 20 to write to required destination. The intended output address could be specified at the start and this reduces the CPU involvement in each tile, but only works if the address of each does not rely on the compressed size of the previous tile.

A preferred embodiment is for the CPU 16 is to provide a scatter list of output (location, size) entries. Each entry is tagged with whether a tile can be split over the end of the entry. This allows a set of network buffers to be described, where each buffer might be split up due to logical/physical translation. The encoder 20 has to indicate which have been used somehow. Within a buffer there has to be allowed enough space for a TS_RFX_TILE block header. This is 19 bytes including (x, y) coordinates, length fields, quantization table fields, then the bit-packed data.

When under high system load, higher compression factors have to be used. Assuming a working average compression factor of six, this leads to a four bits/pixel output. One full high-definition times 30Hz = 60M pixels/second = 240Mbps. Gigabit Ethernet ports will quickly get saturated by not many screens. Therefore, when all users are watching movies the encoder 20 has to compress far more than a factor of six. Two bits/pixel at 5 frames per second is probably near the low end before users start to notice degradation in quality. Reducing the refresh rate saves work all round and reducing the quality increases the asymmetry between encoder input and encoder output. Therefore content-based quantization selection within the encoder 20 is a very useful solution to provide compression flexibility. The encoder 20 will compress by the "right" amount far more often, when compared with the CPU guess the quantization level and then re-encoding is an expensive operation.

High performance PCIe throughput is essential for a high performance encoder product. Each PCIe lane can support uncompressed data for two full high-definition 30Hz updating screens, as a theoretical maximum. So, a 16*PCIe GPU and a 16*PCIe Encoder, with everything else perfect, cannot achieve more than 32 such screens. The current structure of GPU interfacing in Windows 7 does not allow movement of Windows 7 screen data direct between the GPU 18 and the encoder 20. At the very least the data would move twice over the PCIe bus: once moved by a DirectXIO primitive which copies a texture from GPU 18 to system memory, and then again as the encoder 20 issues PCIe reads to that texture.

If the GPU 18 has 16xPCIe and the CPU 16 does a good job of this copy then this reduces the theoretical maximum bandwidth, in particular for a high performance encoder with 16*PCIe lanes would likely be halved in its throughput. Therefore the known path is for the encoder 20 to perform PCIe READs from texture memory in store. It is desirable that the encoder 20 would accept PCIe WRITE operations containing the pixel data.

The encoder 20 provides content-sensitive quantization. The value of doing quantization after the tile store is that it allows the hardware to determine a quantization level, based on a statistical measure of the coefficient values that enter the tile store, from a set of predetermined quantization levels. If there are many large values (which implies that this tile is hard to compress) then it is desirable to increase the quantization level, so that the result is usefully compact.

For the CPU 16, the high level decision is to compress to available bandwidth. Based on a gross measure of current activity, each tile is given a compressed size target at the start of the compression operation. The coefficients exist in ten sub-bands, ranging from low frequency (most important) to high frequency (can be quantized most) data. The encode process can quantize each sub-band separately, so that each tile has ten quantization values sent with it.

As the transform coefficients are stored, the encoder 20 collects statistics about them. The most desirable statistic to collect is the number of significant bits for a range of quality settings, all in parallel. When the tile is ready for entropy coding, the encoder 20 picks the quality setting where the number of significant bits does not exceed the desired encoded tile size. The entropy coder will not result in precisely this many bits but on average it will be close enough to be useful.

A simple area of the screen will not require much space even when coded at maximum possible quality. So, there is little waste, and complexity in one area of the screen will not compromise quality in another. There are other strategies which the CPU 16 can use to balance quality in different areas, or to mend tiles which were sent at low quality and where there is now spare bandwidth available.

There is a further adaptation that can be used when the computer 14 is under heavy load. The GPU 18 subsamples the entire screen in store. The subsampled screen is then passed to the encoder 20 in 32x32 tiles, reducing the initial transfer time by ¾. The encoder 20 operates to encode these as if the high-frequency coefficients are all zeros. Or, the transfer happens 64x64 but the encoder 20 then appears to produce 4 encoded tiles (in a 2x2 square). This reduces PCIe load a great deal for cases where the encoder 20 is going to quantize the high frequency coefficients out of existence anyway.

The fundamental part of the compression process is shown in Figure 6. The method of compressing the frame of pixel tiles comprises, firstly step S6.1, which comprises receiving the frame of pixel tiles, secondly step S6.2 which comprises determining a bandwidth available for each pixel tile, and then for each colour channel of each pixel tile steps S6.3, S6.4 and S6.5 are repeated. This method is executed by the encoder 20, receiving the frame of pixels from either the CPU 16 or the GPU 18, with the information about the bandwidth being supplies by the CPU 16. Preferably, the available bandwidth for each pixel tile is expressed as a bit rate per pixel.

Step S6.3 comprises performing a transform of the pixel data to create a series of coefficients, step S6.4 comprises selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients, and step S6.5 comprises performing a quantization of the series of coefficients using the selected quantization level. In this way, the tiles of the frame are compressed using a quantization level that is selected intelligently, in real-time, as the compression is carried out. Essentially a desired size for the post-quantization data is known and the quantization level is selected to achieve that desired size, taking into account the coefficients that have resulted from the transform step S6.3.

Step S6.4 uses estimates of the final entropy-coded size mapped to a range of different quantization settings. The estimates need not be exact in order to be useful, as meeting an approximate target for the encoded size of the tile is sufficient to meet design objectives concerning bandwidth management. The tile is quantized using a range of quantization values, with different values for each channel and each sub-band. A number of possible estimation methods are possible, ranging in complexity.

In one simple embodiment, the quantization settings for the tile are approximated using a single "quality" metric which is expressed as an expected bit/pixel value. These are determined by exhaustive search over a chosen set of reference images. A small finite set of quality metric settings is chosen, giving "minimum compression", "maximum compression", and a range of values between. Eight values would be sufficient. Each quality setting provides a quantization level for each sub-band.

For each quality metric setting, a statistic is gathered over the coefficients. In a hardware implementation these can be done in parallel. The statistic is the sum over all the coefficients of number of significant bits in the coefficient - quantization shift value for this quality metric, for this sub-band or 1 if this is <= 0. Having gathered this metric, select the quality level where the statistic most closely meets the desired target tile size in bits. The statistic can be improved by giving special consideration to zero coefficients, where the entropy coder in use can code runs of zeros in less than one bit. When a run of zeros appears within a sub-band, after a fixed number of zeros their size is taken to be 0 rather than 1. Making this change after six consecutive zeros gives reasonable results, but the optimum will depend on the precise entropy coding system in use.

Claims

A method of compressing a frame of pixel tiles comprising:

receiving a frame of pixel tiles,

determining a bandwidth available for each pixel tile, and for each colour channel of each pixel tile:

■ performing a transform of the pixel data to create a series of coefficients,

■ selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients, and

■ performing a quantization of the series of coefficients using the selected quantization level.

2. A method according to claim 1, wherein the available bandwidth for each pixel tile is expressed as a bit rate per pixel.

3. A method according to claim 1 or 2, and further comprising varying the bandwidth available for different tiles of the same pixel tile.

4. A method according to claim 1 , 2 or 3, wherein the transform comprises a Haar transform or a Discrete Wavelet Transform.

5. A method according to any preceding claim, and further comprising performing entropy coding of the quantized series of coefficients.

6. A method according to any preceding claim, wherein the step of selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of coefficients comprises mapping the coefficient size to one of a plurality of predetermined levels and using a quantization level pre-assigned to the mapped level.

7. A device for compressing a frame of pixel tiles comprising an encoder arranged to:

o receive a frame of pixel tiles,

o determine a bandwidth available for each pixel tile, and o for each colour channel of each pixel tile:

^■ perform a transform of the pixel data to create a series of coefficients,

■ select a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of the coefficients, and

■ perform a quantization of the series of coefficients using the selected quantization level.

8. A device according to claim 7, wherein the available bandwidth for each pixel tile is expressed as a bit rate per pixel.

9. A device according to claim 7 or 8, wherein the encoder is further arranged to vary the bandwidth available for different tiles of the same pixel tile.

10. A device according to claim 7, 8 or 9, wherein the transform comprises a Haar transform or a Discrete Wavelet Transform.

11. A device according to any one of claims 7 to 10, wherein the encoder is further arranged to perform entropy coding of the quantized series of coefficients.

12. A device according to any one of claims 7 to 11 , wherein the encoder is arranged, when selecting a quantization level from a set of predetermined quantization levels according to a function of the determined bandwidth and the size of coefficients, to map the coefficient size to one of a plurality of predetermined stages and using a quantization level pre-assigned to the mapped stage.