GB2360916A

GB2360916A - Compression encoder which transmits difference between new data word and recent data word where this falls within a threshold

Info

Publication number: GB2360916A
Application number: GB0018263A
Authority: GB
Inventors: Jason Charles Pelly; Jonathan James Stone
Original assignee: Sony United Kingdom Ltd
Current assignee: Sony Europe Ltd
Priority date: 2000-03-30
Filing date: 2000-07-25
Publication date: 2001-10-03
Anticipated expiration: 2020-07-25
Also published as: GB0007784D0; GB2360916A9; GB0018263D0; GB2360916B

Abstract

A compression encoder receives data words and stores at least part of the most recently received ones in a memory. Newly received data words are compared with the at least parts of data words in the memory, where the similarity between the data words falls within a predetermined threshold they are deemed to match. If such a match condition occurs the compression encoder outputs data indicating the stored data word that the new data word matches and data defining the difference between them. The threshold parameter which is used in the comparison of the data words is varied in dependence upon the data received. The threshold may be a number of identical bits or a numerical difference between the data words.

Description

P18338.GBP 1 2360916 DATA COMPRESSION This invention relates to data

compression.

Data compression is widely used in the storage and transmission of data such as 5 general computer data and images.

Data compression algorithms operate on an input data stream to generate a compressed data stream, which is then decompressed to form an output data stream. They can generally be subdivided into two categories, lossy algorithms and lossless algorithms.

Lossy algorithms, such as the MPEG video compression algorithm, are conunonly io applied to image data and do not guarantee that the output data stream will be an exact replica of the input data stream. On the other hand, so- called lossless algorithms do recreate the input data stream exactly at decompression.

Lossless algorithms commonly used for compressing computer data includd the so-called "ZIF' and "ALDC" algorithms based on the work of Lempel and Ziv originally described in the paper, "A Universal Algorithm for Sequential Data Compression", IEEE Trans Inform Theory, vol. IT-23, No 3, pp337-343, 1977. ALDC is further described in the ANSI standard for Information Technology, "Adaptive Lossless Data Compression (ALDC)", X3B5/95-318A, December 1995.

In Lempel and Ziv's algorithms, multi-bit input data words are stored in a history buffer which contains (typically) the 512 most recent such words. As each data word is received, it is compared to the words already stored in the history buffer. If no matches are found, which is to say that the newly received data word is not the same as any data words stored in the history buffer, or if the match with the history buffer is only one word long, then the new data word is added to the compressed data stream as a "literal", that is, without any encoding being applied. However, if two or more successive data words match the same string of data words in the history buffer then a "copy pointe?' is added to the compressed data stream. The copy pointer indicates a start position in the history buffer and the length of the matching string of data words. This algorithm therefore exploits a repetitive nature often found in computer data - if string matches are found frequently then good compression is achieved as a copy pointer is arranged to require fewer bits in the compressed data stream than the original data. However, if string matches are not found sufficiently frequently, these algorithms can actually worsen the data rate rather than compress it, as the compressed data stream then consists of data P/8338.GBP 2 words transmitted as literals plus a certain amount of overhead and synchronisation information.

Another class of lossless algorithm which works particularly well for image data is the so-called "Rice" algorithm based around the paper, "A VLSI Chip Set for High- Speed Lossless Data Compression", J Venbrux et al, MEE Trans on Circuits & Systems for Video Tech, vol. 2, No 4, 1992.

The Rice algorithm requires input data to be passed through a data modeller such as a differential pulse code modulation (DPCM) modeller before encoding. The modeller is designed to weight the distribution of the data towards a predominance of zero-valued i o data words. Coding then consists of adding the n least significant bits (LSBs) directly to the compressed data stream and performing fundamental sequence encoding on the remaining most significant bits (MSBs). While the Rice algorithm can work very weg in particular circumstances, it is dependent upon the modelling algorithm matching- the characteristics and statistical properties of the input data stream to be compressed. The Rice algorithm is therefore far from being a generic algorithm - instead, it must be well matched to the nature of the input data stream in use.

Our copending applications numbers 0007782.6 and 0007781.8 (agent's references P/833TGB and P/8340.GB), having the same filing date as the present application, propose data compression apparatus and algorithms potentially offering some of the advantages of ALDC and of image-data-specific algorithms such as Rice.

In application number 0007781.8 (agent's reference P/8340.GB), input data words for encoding are stored in a data memory (e.g. a history buffer), but instead of requiring an exact match for strings of words stored in the buffer (as in ALDC), only the most significant bits are matched. The n least significant bits are added to the output compressed data stream in a form which does not depend on matches with the history buffer. In application number COOM- (p (agent's reference P/833TGB), input data words are also stored in a data memory (e.g. a history buffer), but instead of requiring an exact match for strings of words stored in the buffer (as in ALDC), a match within a certain numerical range is also considered a match. The numerical range is such that a match can be obtained between a word in the data memory and at least two possible numerical values of an input data word.

In general, in trials of prototype embodiments it has been found that each of these techniques can provide results which are often better than those obtained with ALDC.

P/8338.GBP 3 This invention provides data compression apparatus for compressing input data words into an output compressed data stream, the apparatus comprising:

a data memory for storing at least a part of each of a plurality of mostrecently received input data words; comparing logic for comparing each input data word with data words stored in the data memory to detect whether a match condition exists, a match condition being defined by an input data word being similar to a data word stored in the data memory to within a degree of similarity defined by a compression parameter; an encoder for encoding an output data stream portion in respect of an input data io word for which a match condition exists by reference to a data word stored in the data memory and difference data defining a difference between that input data word and that an analyser for analysing properties of the input data words to determine a compression parameter to be used by the comparing logic.

The invention recognises that in data compression systems such as the ones outlined above, the best value of a compression parameter may differ depending on the category of data to be compressed. For example, in trials of prototype embodiments it has been found that a compression parameter implying (in either example system) that a looser identity between an input data word and a history buffer entry is often appropriate for image data, whereas a requirement for a closer identity is often more appropriate for data representing text. However, the compression apparatus may not have access to information defining the type of data to be compressed.

The invention addresses this problem by analysing the input data itself to determine the best (or at least a useful) compression parameter to use in the compression of that data. The analysis can, for example, take into account statistical properties oithe input data words from word to word, such as the numerical or logical differences between adjacent words.

In this way, a useful compression parameter can be obtained "in advance", that is to say, without having to analyse the success or otherwise of the compression process itself. In this way, it is possible to vary the compression parameter during compression of, say, a large file of data, to reflect the varying properties of the data during the compression process.

Further respective aspects and features of the invention are defined in the appended claims.

P/8338.GBP 4 Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, throughout which like parts are indicated by like references, and in which:

Figure 1 is a schematic diagram of a computer network employing lossless data 5 compression; Figure 2 is a schematic diagram of a first embodiment of a data compression encoder, to be referred to as an 'SZW' encoder; Figure 3 is a schematic diagram of an SZIP decoder complementary to the encoder of Figure 2; Figure 4 is a schematic diagram of a second embodiment of a data compression encoder, to be referred to as a "DZW' encoder; Figure 5 is a schematic diagram of a WIP decoder complementary to the encoder of Figure 4; Figure 6 schematically illustrates an adaptive SZIP encoder; Figure 7 schematically illustrates a part of the encoder of Figure 6; Figure 8 schematically illustrates an adaptive WIP encoder; Figure 9 schematically illustrates a control circuit of the encoder of Figure 8; and Figure 10 schematically illustrates an adaptive SZIP or WIP decoder complementary to the respective encoder of Figure 6 or Figure 8.

Figure 1 is a schematic diagram of a computer network employing lossless data compression. A server computer 10 comprising a processing unit 60 and a display 65 is linked to a client computer 20 by a network connection 30. Data stored on a data storage medium 40 at the server, such as a hard disc store, a RAID store, an optical disc store or a tape store is communicated, via a network interface card 50 in the processing unit 60 of the server to the network connection 30. At the client, the communicated data is received by a network interface card 70 in a processing unit 80 of the client.

The network interface card 50 is arranged to perform lossless data compression and the network interface card 70 is arranged to perform a complementary lossless decompression process.

Because the compression and decompression are lossless, the data received at the output of the decompression process by the processing unit 80 are identical to the data retrieved by the server from the data store 40. However, the compression process means that data traffic on the network connection 30 is less than would be the case if the data were transmitted without compression.

P18338.GBP 5 IISZIP" Encoder Figure 2 is a schematic diagram of a first embodiment of a data compression encoder, to be referred to for convenience as an "SZW' encoder. This example of an encoder and subsequently described embodiments in the present application are shown operating on 8 bit input data words, but it will be clear that other types of input data word could be used.

Each input data word is passed to a demultiplexer 100 which, in response to a io control input value "n", where n is an integer no less than zero and no more than the number of bits per word (8 in this example), splits the input data word into two subwords, one formed of n least significant bits (LSBs) and the other formed of (8-n)nost significant bits (MSBs).

The n Ms are stored in a buffer 110 for the duration of the processing applied to the remaining MSBs.

The (8-n) MSBs are passed to a compare processor 120 which compares them with corresponding MSBs of entries in a history buffer 130.

The history buffer 130 is similar to the buffer used in the ALDC algorithm and may contain, for example 512, 1024 or 2048 data words, although the skilled man will appreciate that the choice of size of the history buffer is a matter of routine. The history buffer is updated by adding the current input data word, and discarding (once the history buffer is full) the least recently stored data word, each time a current data word has been processed. In the present embodiment, the history buffer stores all eight bits of each input data word. This has advantages in allowing the value of n to be varied during the data compression process (see the description of Figure 6 below). However, if the value of n is fixed in a predetermined manner, the history buffer could be arranged to store only the (8n) MSBs of each input data word. Having said this, data memories of the order of 512 to 2048 bytes are cheap, even when implemented (as in these examples) on an application specific integrated circuit (ASIC) or field programmable gate array (FPGA), so the present embodiment stores all eight bits to allow a greater flexibility of design and use.

The operations carried out by the compare processor 120 involve searching the history buffer for strings of data words having (8-n) MSBs matching the corresponding (8-n) MSBs of the incoming data stream. The compare processor identifies where the longest match has occurred. If the length of this match is greater than one data word long P/8338.GBP 6 (although in general the minimum number of matches may be made greater than 2), then details defining the match in the history buffer are encoded as a compressed word. The compressed word encodes the length of the match and its position within the buffer. If no matches occur, or if the match is only one data word long, then the 8-n MSBs of the input data word are encoded as a "literal", that is to say, raw data.

Compression is therefore achieved by encoding several data words within a single compressed word. Markers are required to inform a subsequent decoder whether a compressed word or a literal word has been encoded.

Where a compressed word is transmitted, the decoder (to be described in greater jo detail below) accesses a string of data words at the corresponding position in a history buffer maintained at the decoder. The decoder adds each decoded data word to its history buffer as soon as it is decoded, and so in this way maintains an identical history buffer to that used at the encoder.

So, as regards the (8-n) MSBs, the processing applied to that sub-word is in fact similar to the ALDC algorithm. Indeed, if the variable n were set to zero, the encoder of Figure 2 would operate as an ALDC encoder. Accordingly, architectures which have been proposed for the ALDC algorithm, such as an architecture defined in US-A-5 652 878, may be used or adapted for the compare processor and history buffer.

The output of the compare processor 120 is therefore either a literal, being the (8 n) MSBs in the case that a match of two or more words has not been found, or a compressed word where a match of at least two words has been found. When such an output is generated by the compare processor 120, the compare processor controls the buffer 110 to output the LS13s which have been temporarily stored in the buffer. These are output in an uncompressed or "literal" form and passed, with the output of the compare processor 120, to an output multiplexer 140 which generates an output data stream. The syntax of the output data stream will be described below.

In common with the ALDC architecture, the length of match in the history buffer is transmitted as a HufIfflan. code of 2, 4, 6, 8 or 12 bits using the coding values set out in the following table.

P/8338.GBP 7 Contents of Length Code Match Count in Bytes Field

00 2 01 3 1000 4 1011 7 110000 8 110111 15 11100000 16 11101111 31 11110000 0000 32 1111 1110 1110 270 1111 1110 1111 271 i It will be seen from the above table that the maximum match count available with this code is 271. Accordingly, the compare processor is arranged to force an output if the length of match reaches 271 data words. However, it will of course be appreciated that other encoding schemes using different length or entirely different codes or coding arrangements could be employed.

If the minimum number of matches (see above) is made greater than 2, the 270 distinct codes provided in the above table can instead be mapped to a different range of io match lengths or counts. So, for example, if the minimum match length were 7, the above codes could be mapped to match lengths of 7 to 276.

The output data syntax is as follows. Each compressed data word is either a literal or a copy pointer. The first bit of the compressed data forms a flag to indicate which of these is in use.

Literal: compressed data 0 P/8338.GBP 8 where each is one bit of the eight bit input word, taking the value 0 or 1.

Copy Pointer: compressed data 1 <length code><displacement><LSBs> where <length code> Huffinan code from above table.

<displacement> ... , i.e. a binary number defining a position within the history buffer. The number of bits is selected to be sufficient to encode all possible positions within the history buffer, e.g. 11 bits for a 2048-word history buffer.

<LSBs> ... , i.e. length n biti I'SZW' Decoder Figure 3 is a schematic diagram of an SZIP decoder complementary to the encoder of Figure 2.

The decoder comprises an input demultiplexer 200, a control circuit 210, a history buffer 220, an LSB buffer 230 and an output multiplexer 240.

The input demultiplexer 200 is responsive to the parameter n and also to control flags within the inp ut compressed data stream, to separate the unencoded LS13s from the encoded MSBs of the input data stream. These flags are derived by the control circuit 2 10 and serve to indicate to the input demultiplexer how many LSBs to extract from the input data stream.

The unencoded LSBs are supplied to an LS13 buffer 230 where they are stored until an output is forced by the control circuit 2 10.

As described above, data representing the MSBs is formed of either literals, in which the MSBs are simply transmitted without flirther encoding, or compressed data words giving a start position and string length within the history buffer.

In the case of a literal, the control circuit simply forwards the (8-n) MSBs to the output multiplexer 240 and forces an output of the corresponding LSBs from the LS13 buffer 230 to the output multiplexer 240. The output multiplexer combines the MSBs and LSBs to form an output data word. The output data word is also written to the history P/8338.GBP 9 buffer 220 and, once the history buffer is full, the least recently added entry is discarded from the history buffer.

In the case of a compressed data word representing the MSBs, the control circuit 210 decodes the start position and string length from the compressed data word and accesses the appropriate entries in the history buffer 220. The (8-n) MSBs of the accessed entries in the history buffer are passed back to the control circuit which forwards them to the output multiplexer 240. At the same time, the control circuit forces an output from the LS13 buffer 230 and the output data words are thus formed by the output multiplexer 240 concatenating the MSBs and LS13s. The history buffer is also updated with each output io data word.

In a similar manner to the history buffer of the encoder of Figure 2, the history buffer 220 could be arranged to store only the (8-n) MSBs of each decoded data w&d, in which case the update signal to the history buffer could be obtained from the output of the control circuit 2 10. In the present embodiment, however, for maximum flexibility and the other reasons discussed earlier, the history buffer 220 is updated with the entire output data words generated by the output multiplexer 240.

"DZIP" Encoder Figure 4 is a schematic diagram of another embodiment of a data compression encoder, to be referred to for convenience as a "WIP' encoder. The encoder comprises a difference detector 300, pair of comparators 310, 320, calculation logic 3-30, a history buffer 340, a buffer 350 and an output circuit 360. The difference detector, comparators and buffer are shown grouped together as a comparison unit 370.

The history buffer 340 is similar to the buffer used in the ALDC algorithm and the buffer 130 of Figure 2 and may contain, for example 512, 1024 or 2048 data words, although the skilled man will appreciate that the choice of size of the history buffer is a matter of routine. The history buffer is updated by adding the current input data word, and, once the history buffer is full, discarding the least recently stored data word, each time a current data word has been processed.

From the value of n currently in use, the logic 330 derives the values of two further variables u and v, either by direct calculation or from a look-up table. The variables u and v are set to zero if n = 0. Otherwise, they are defined as follows:

2 n-1 P/8338.GBP 10 V = 2 1 These definitions can be swapped round if desired.

As in Figure 2, the basic principle underlying this embodiment is the history buffer is searched by the comparison unit 370 for strings of data words matching the incoming data stream. The encoder identifies where the longest match has occurred. If the length of this match is greater than one data word long (although in general the minimum match length may be made greater than 2), then details defining the match in the history buffer are encoded as a compressed word. The compressed word encodes the length of the match and its position within the buffer. If no matches occur, or if the match io is only one data word long, then the input data words are encoded as a "literal", that is to say, raw data. In carrying out this matching process, however, an input data word x is defined as being identical to a symbol y in the history buffer 340 if..

X-V ≤ y ≤ x+u that is to say, if the numerical difference between the binary values y and x is within certain limits.

In order to assess whether this inequality is true, the difference detector 300 first derives the numerical difference between the current input data word and a current word to be examined from the history buffer. (This process is repeated or carried out in parallel for other entries in the history buffer, but for clarity the processing will be described with reference to a single entry). This generates a value (y-x) which is passed to the two comparators 3110, 320.

The comparators compare the value (y-x) with u and v, to detect whether the following two conditions apply:

Is (y-x) less than or equal to u? (comparator 3 10) Is (y-x) greater than or equal to -v? (comparator 320) If the two conditions both apply then the current input data word is the same or sufficiently similar to the history buffer entry to be considered a "match".

The buffer 350 stores details of entries considered to be a match, along with the difference values (y-x) obtained in respect of those entries.

Of course, the polarity of the operation of the difference detectors and comparators, or indeed the specific arrangement shown in Figure 4, can be altered without affecting the underlying principle that an input data word is accepted as a match with a P/833S.GBP 11 history buffer entry if it is within a certain numerical range of that history buffer entry. it is preferred that the range is defined at least in part by a control parameter (e.g. the value n in this case) although the ranges could be fixed and predetermined.

So, for example, a simple equivalent is that the difference detector could output the value (x-y) and the comparator polarities be reversed. The dependence of u and v on the control parameter n could be practically any analytical, linear, non-linear or other relationship, possibly being defined by a look-up table. The range of values within which a match is obtained could be one-sided with respect to the history buffer entry or extend above and below the buffer entry as in this example.

Strings of "matches" (i.e. the same value or values close to one another as discussed above) are built up in much the same way as the ALDC encoder or the encoder of Figure 2. The output circuit 360 identifies where the longest match has occ&red. Compression is therefore achieved by encoding several data words within a single compressed word. Markers are required to inform a subsequent decoder whether a compressed word or a literal word has been encoded.

Where a compressed word is transmitted, the decoder (to be described in greater detail below) accesses a string of data words at the corresponding position in a history buffer maintained at the decoder. The decoder adds each decoded data word to its history buffer as soon as it is decoded, and so in this way maintains an identical history buffer to that used at the encoder.

So, apart from the fact that a match is defined in a fundamentally different way, which in turn leads to a different way of encoding the compressed output data, the processing applied to each input data word is in some respects similar to the ALDC algorithm. Indeed, if the variable n were set to zero, the encoder of Figure 4 would operate as an ALDC encoder. Accordingly, architectures which have been proposed for the ALDC algorithm, such as an architecture defined in US-A- 5 652 878, may be used or adapted for the output circuit and history buffer.

The output of the output circuit 360 is therefore a literal, being the input data word in the case that a match of two or more words has not been found, or a compressed word where a match of at least two words has been found. The syntax of the output data stream will be described below.

In common with the ALDC architecture and the encoder of Figure 2, the length of match in the history buffer is transmitted as a HufIfflan code of 2, 4, 6, 8 or 12 bits using the coding values set out in the following table.

P/8338.GBP 12 Contents of Length Code Match Count in Bytes Field

00 2 01 3 1000 4 1011 7 110000 110111 15 11100000 16 1110 1111 31 1111 0000 0000 32 1111 1110 1110 270 1111 1110 1111 271 It will be seen from the above table that the maximum match count available with this code is 271. Accordingly, the compare processor is arranged to force an output if the length of match reaches 271 data words. Of course, other coding schemes, different length codes or entirely different codes could be used.

If the minimum number of matches (see above) is made greater than 2, the 270 distinct codes provided in the above table can instead be mapped to a different range of match lengths or counts. So, for example, if the minimum match length were 7, the above io codes could be mapped to match lengths of 7 to 276.

Literal: compressed data 0 P/8338.GBP 13 where each is one bit of the eight bit input word, taking the value 0 or 1.

Copy Pointer: compressed data I<Iength code><displacement><differences> where <length code> HuffInan code from above table.

J; <differences> ... , i.e. a binary iiu-mber defining the numerical difference between each input data word and the corresponding entry in the string to be accessed in the history buffer. The number of bits is selected to be sufficient to encode all possible differences, which in turn depends on the value of n. In the example above, each difference value requires n bits to encode. The encoding of the positive and negative differences makes use of an encoding arrangement described below.

Encoding Difference Values The difference values are encoded so that the numerically lowest difference value is coded as 0 and the highest difference value with 2" - 1, all using n bits. So, for example, in DZIP(2), the possible difference values are -1, 0, 1, 2. These are encoded as:

0 01 1 10 2 11 I'MIP11 Decoder P/8338.GBP 14 Figure 5 is a schematic diagram of a DZIP decoder complementary to the encoder of Figure 4.

The decoder comprises an input demultiplexer 400, a control circuit 4 10, a history buffer 420, a difference data detector 430 and an output adder 440.

The input demultiplexer 400 is responsive to control flags within the input compressed data stream (as detected by the control circuit 410), to separate data defining matches and literals from difference data of the input data stream. Difference data is routed to the difference data detector 430. Other data defining matches or literals is routed to the control circuit 4 10. As described above, data representing the original data io words is formed of either literals, in which the data words is simply transmitted without further encoding, or compressed data wordsgiving a start position and string length within the history buffer.

In the case of a literal, the control circuit 410 simply forwards the data word to the output adder 440. The output of the difference data detector 43 W is zero, since no difference data is provided with a literal. The output adder 440 thus adds zero to the data word and so outputs that data word. The output data word is also written to the history buffer 420 and the least recently added entry is discarded from the history buffer.

In the case of a compressed data word, the control circuit 410 decodes the start position and string length from the compressed data word and accesses the appropriate entries in the history buffer 420. The accessed entries in the history buffer are passed back to the control circuit which forwards them to the output adder 440. At the same time, the difference values corresponding to those words are derived from the input data word's difference data and are passed to the output adder. The history buffer entries and difference values are then added by the output adder 440 to form respective output data words. The history buffer is also updated with each output data word.

As the parameter n is varied, the minimum run length considered to be a string match is preferably also varied. This is because larger values of n result in the need for larger numbers of bits to encode the LSBs in an SZIP data stream, or the differences in a WIP data stream. So, the run length at which it becomes beneficial (in terms of quantity of output data) to encode a copy pointer rather than a series of literals also changes.

Using the parameters and the example given above, the preferred minimum run length L for SZIP and DZIP is the smallest integer such that L>( 12 / (9-n)) P/8338.GBP 15 The values for each n is given below:

n L 0 2 1 2 2 2 3 3 4 3 4 6 5 7 7 Adaptive 'ISZW' Encoder Figure 6 schematically illustrates an adaptive SZIP encoder.

The encoder of Figure 6 comprises an input demultiplexer 700, an array of bit change detectors 710, an array of counters 720, an array of comparators 730, an analyser i o 740, an optional delay unit 750 and an SZIP (n) encoder 760. The SZIP (n) encoder 760 may be of the form shown in Figure 2, and carries out SZIP encoding of input data according to a parameter n, where n specifies the number of LSBs not to take part in the comparison process with the contents of the history buffer. The output of the SZIP (n) encoder 760 is compressed data as described in detail above.

The basic operation of the apparatus of Figure 6 is that a value of n is selected which is appropriate to input data currently being received. If the delay unit 750 is incorporated so as to delay the input data while this derivation of the most appropriate value of n is taking place, then that value of n can be applied to the input data from which it was derived, for example on a block-by-block basis. Alternatively, if the delay unit 750 is omitted (but the bypass data path to the SZIP(n) encoder retained), then a value of n derived in respect of a preceding block is applied to the compression of data of a current block. This latter arrangement need not be a disadvantage because the nature of data to be compressed does not tend to change rapidly and frequently from, for example, computer P/833S.GBP 16 data to image data and vice versa. In fact the absence of a buffer avoids an overall processing delay through the system beyond that applied by the SZIP (n) encoder 760.

In operation, the input demultiplexer 700 splits each input data word into its constituent bits, in this example eight bits from an HB through to an MSB. Each bit is supplied to a respective change detector 701. Figure 7 schematically illustrates the processing applied to each bit and shows the change detector 710 in more detail. The change detector comprises. a one-bit delay 712 and a comparator 714 arranged so that a current bit value is compared with the immediately preceding bit value. This arrangement generates one of two outputs, namely a first output indicating that the current bit and the i o delayed bit are identical, or a second output indicating that they are not the same.

For each bit, the output of the change detector is supplied to a counter 720, again shown in more detail in Figure 7. The counter 720 counts the number of times that the current bit and preceding bit are detected to be identical. A block reset signal can be-used to clear the counter, and this signal is generated by the analyser 740 when a value of n is 15 output to the SZIP (n) encoder 760.

When an appropriate number of bits have been tested by the change detector 710 and the counter 720, the analyser 740 (via a "sync" control shown in Figure 7) causes the count values for each bit to be compared with a respective threshold by the comparator 730. The thresholds can be different for each bit position from MSB to HB. For a block 20 of, say, 2048 input data words tested in this way, a set of thresholds might be as follows:

P/8338.GBP 17 Of course, different threshold values for each bit can be used if desired.

The intention behind this arrangement is to detect which bits of the input data words tend to be identical from data word to data word and which behave "noisily" by varying a lot between data words. So, if the number of times that a bit is detected to be identical to the preceding bit (the count value from the counter 720) is lower than the respective threshold value, that bit position is determined to be a "noisy" bit position and the value of n can be selected so as to exclude that bit position from the matching process.

To illustrate this point, the following are example count values:

The first three LS13s are detected by this process to be "noisy", in that the number of times that they are identical from bit to bit within a test block is lower than a respective is threshold value. Accordingly, n is set to three by the analyser 740 so as to exclude those bits from the matching process with the history buffer.

The arrangement described above therefore provides an adaptive encoder in which a value of n can be selected, on a block-by-block basis, so as to be appropriate to the input data in use. It is noted that the process does not analyse the success of the compression process carried out by the SZIP (n) encoder, but instead derives the appropriate value-of n by analysis of the input data itself. If the optional delay unit 750 is used, then that value of n can be applied to the actual data from which it was derived. If the delay unit is not used, there is a one or more block lag between the generation and use of the appropriate value of n, but this need not be a problem, especially where the block size is set to be much lower than the typical rate of variation within data files.

The value of n which is used in the encoding of data by the adaptive encoder described above needs to be added to the data stream so that information can be extracted by the decoder for use in decoding the compressed data. The value of n is encoded into the data stream using socalled marker codes defined by the ALDC standard.

Bit: LS13 1 2 3 4 5 6 MSB Threshold: 1372 1372 1372 1372 1372 1372 1372 1372 Bit: LS13 1 2 3 4 5 6 MSB Count: 1010 1100 1212 1400 1750 1600 2000 2040 P/8338.GBP 18 InALDCthere are 16 marker codes 1 1111 11110000 -3 1 1111 1111 1111. The last code in this range signifies an end of file (EOF). The others are classified in the standard as reserved, but are used here as follows to signify a new value of the parameter n:

1 1111 11110000 signifies n--0 1 1111 11110001 signifies n--1 and so on.

This allows values of n up to 15 to be encoded directly. Higher values of n can be encoded by coneatenating two marker codes, so that a first marker code indicates a io change of numerical range (e.g. to 15-30) for subsequent marker codes.

The selection of a new value of n (if indeed it is different to the current value) is arranged to force an output by the encoder, which might otherwise have been part---way through a matching process. However, the system could instead be arranged to wait-until the next literal is output.

Adaptive "DZIP" Encoder Figure 8 schematically illustrates an adaptive DZIP encoder.

The encoder of Figure 8 comprises a subtractor 800, a one-word delay 8 10, an absolute value detector 820, a statistical analyser 830, a control circuit 840, an optional delay unit 850 and an WIP (n) encoder 860. The DZIP (n) encoder 860 may be of the form shown in Figure 4, and carries out DZIP encoding of input data according to a parameter n, where n specifies (indirectly) the numerical range or tolerance within which a history buffer entry is considered to be a match to an input data word. The output of the WIP (n) encoder 860 is compressed data as described in detail above.

The basic operation of the apparatus of Figure 8 is that a value of n is selected which is appropriate to input data currently being received. If the delay unit 850 is incorporated so as to delay the input data while this derivation of the most appropriate value of n is taking place, then that value of n can be applied to the input data from which o it was derived, for example on a block-by-block basis. Alternatively, if the delay unit 850 is omitted (but the bypass path to the WIP(n) encoder retained), then a value of n derived in respect of a preceding block is applied to the compression of data of a current block.

This latter arrangement need not be a disadvantage because the nature of data to be compressed does not tend to change rapidly and frequently from, for example, computer P/833MBP 19 data to image data and vice versa. In fact it avoids an overall processing delay through the system beyond that applied by the WIP (n) encoder 860.

In operation, the subtractor 800 and one-word delay 810 detect the numerical difference between each input data word and the inunediately preceding data word. The absolute value of this difference is obtained by the absolute value detector 820 and is passed to the statistical analyser 830.

When an appropriate number of data word difference values have been received by the statistical analyser 830, the statistical analyser 830 derives the mean and variance of the absolute difference values and passes these to the control circuit 840. The io statistical analyser can be arranged to do this in response to a "block reseC signal, which may be generated simply by a resettable counter (not shown) or by a block synchronising signal elsewhere in the system. The block reset signal is also arranged to caus-e the control circuit to calculate a new value of n on the basis of the statistical information received from the statistical analyser 830.

The control circuit 840 is partly illustrated schernatically in more detail in Figure 9. As mentioned above, at the end of statistical analysis of a block or group of input data words, the control circuit receives the mean value and the variance value for the absolute difference values obtained between ad acent data words in that group. The mean and variance are each compared to respective threshold values (by a set of comparators 870) corresponding to values of n from 0 to 7. Figure 9 schematically illustrates this process as applied to the mean absolute difference value. A selector 880 selects a new value of n in accordance with the following rules:

Let "np" signify the new parameter n. "np" is initialised to the current value of n being used. Then, various comparisons are made with the mean and variance:

IF (variance <1 OR mean < 1) then rip = 0 ELSE IF (mean ≤ 1.5) then np =2 ELSE IF (mean ≤ 4) then np =3 ELSE IF (mean ≤ 8) then np =4 ELSE IF (variance > 100 AND mean <-- 20) then np = 5 ELSE np =3 The intention behind this arrangement is to detect the general degree of variation within the input data so that a value of n can be selected which is likely to give a predominance of matches, but without wasting unnecessary data on difference data.

P/8338.GBP 20 To illustrate the operation of the encoder of Figure 8, example mean and variance values for the absolute differences between adjacent data words are as follows:

Mean = 3.2 Variance = 10.3 From the thresholds in the above scheme, it can be seen that np = 3 The arrangement described above therefore provides an adaptive encoder in which a value of n can be selected, on a block-by-block basis, so as to be appropriate to the input data in use. It is noted that the process does not analyse the success of the compression process carried out by the WIP (n) encoder, but instead derives the appropriate value of n io by analysis of the input data itself. If the optional delay unit 850 is used, then that value of n can be applied to the actual data from which it was derived. If the delay unit is not used, there is a one or more block lag between the generation and use of the appropriate value of n, but this need not be a problem, especially where the block size is set - to be much lower than the typical rate of variation within data files. Also, not including the delay unit can reduce the overall processing delay through the system.

The value of n which is used in the encoding of data by the adaptive encoder described above needs to be added to the data stream so that information can be extracted by the decoder for use in decoding the compressed data. The value of n is encoded into the data stream using so-called marker codes defined by the ALDC standard.

In ALDC there are 16 marker codes 1 1111 11110000 -3 1 1111 1111 1111. The last code in this range signifies an end of file (EOF). The others are classified in the standard as reserved, but are used here as follows to signify a new value of the parameter n:

1 1111 11110000 signifies n=0 1 1111 11110001 signifies n=l and so on.

This allows values of n up to 15 to be encoded directly. Higher values of n can be encoded by concatenating two marker codes, so that a first marker code indicates a change of numerical range (e.g. to 15-30) for subsequent marker codes.

The selection of a new value of n (if indeed it is different to the current value) is arranged to force an output by the encoder, which might otherwise have been part way through a matching process. However, the system could instead be arranged to wait until the next literal is output.

P18338.GBP 21 Adaptive Decoders Figure 10 schematically illustrates an adaptive SZIP or DZIP decoder complementary to the respective encoder of Figure 6 or Figure 8.

Figure 10 shows an "n" value detector 900 and an SZIP (n) or DZIP (n) decoder 910. The decoder 9 10 may be a decoder as shown in Figure 3 or Figure 5.

operation of this adaptive decoder arrangement is straightforward, in that the "n" value detector 900 detects encoded values of n from the compressed data stream by searching for appropriate codes such as those defined above. When a new value of n is i o detected, this is passed to the decoder 910 for use in decoding subsequent compressed data. Parts of the compressed data stream which do not correspond to the encoding of a new value of n are passed directly by the "n" value detector 900 to the decoder 91.0 for decoding.

The skilled man will appreciate that the embodiments in this description may be implemented as hardware, programmable or custom hardware such as an ASIC or FPGA, a mixture of hardware and software or purely by software running on a known data processing apparatus. Where the implementation may involve software, it will be appreciated that the software and a storage medium holding some or all of that software are also considered to be embodiments of the present invention.

P/8338.GBP 22

Claims

1. Data compression apparatus for compressing input data words into an output compressed data stream, the apparatus comprising:

a data memory for storing at least a part of each of a plurality of mostrecently received input data words; comparing logic for comparing each input data word with data words stored in the data memory to detect whether a match condition exists, a match condition being defined by an input data word being similar to a data word stored in the data memory to within a io degree of similarity defined by a compression parameter; an encoder for encoding an output data stream portion in respect of an input data word for which a match condition exists by reference to a data word stored in the-rdata memory and difference data defining a difference between that input data word and that an analyser for analysing properties of the input data words to determine a compression parameter to be used by the comparing logic.

2. Apparatus according to claim 1, in which the analyser is arranged to generate a compression parameter in respect of each successive one of a plurality of groups of received input data words.

3. Apparatus according to claim 2, comprising a delay arranged to delay received data words during at least part of the processing undertaken by the analyser in respect of a group of received input data words.

4. Apparatus according to any one of the preceding claims, comprising a compression parameter encoder for encoding a compression parameter into the output compressed data stream.

5. Apparatus according to any one of the preceding claims, in which: each input data word is an m-bit data word; the compression parameter defines an integer n where, 1 -<n<m; P/833S.GBP 23 the comparing logic is operable to compare (m-n) predetermined bit positions bits of each input data word with corresponding bit positions of data words stored in the data memory to detect whether a match condition exists; and the difference data represents the n other bit positions of the input data words.

6. Apparatus according to claim 5, in which the analyser comprises:

a difference detector for detecting bit differences at one or more bit positions between an input data word and a preceding input data word; a counter for generating one or more count value dependent on occurrences of io detected bit differences between input data words; and a comparator for comparing the count value(s) with respective threshold(s).

7. Apparatus according to claim 6, in which:

the (m-n) predetermined bit positions are (m-n) most significant bits; and the analyser is operable to set n to be at least as high as the most significant bit position for which a comparison of the count value and the respective threshold indicates that more than a required maximum number of bit differences have occurred at that bit position.

8. Apparatus according to any one of claims 1 to 4, in which:

the comparing logic is operable to indicate that a match condition exists between an input data word and a data word in the data memory if the numerical values of the two data words are within a non-zero numerical range defined by the compression parameter; and the difference data defines the numerical difference between the input data word and the matching data word in the data memory.

9. Apparatus according to claim 8, in which the numerical range for a match condition is defined by:

X -V <-- y <-- X + U where:

x is the numerical value of an input data word; y is the numerical value of a data word in the data memory; and at least one of u and v is greater than zero.

P/8338.GBP 24

10. Apparatus according to claim 9, in which either:

U 2 n-1 V 2 n-1 _1 or:

V 2 n-1 U 2 n-1

11. Apparatus according to any one of claims 8 to 10, in which the analyser io comprises: a difference detector arranged to detect the numerical difference between each input data word and a preceding input data word; and a statistics detector arranged to detect the statistical distribution of the numerical differences detected by the difference detector.

12. Apparatus according to claim 11, in which the analyser is operable to set n in response to a comparison of the mean and variance of the input data with predetermined thresholds.

13. Apparatus according to any one of the preceding claims, in which the encoder is operable to encode a sequence of at least x input data words for which, match conditions exist with a corresponding sequence of data words in the data memory by reference to the position of the matching sequence in the data memory, where the parameter x depends upon the compression parameter used by the comparing logic.

14. Data decompression apparatus for decompressing an input compressed data stream into output data words, the apparatus comprising:

a data memory for storing at least a part of a plurality of most-recently decompressed input data words; logic for detecting whether a currently received data portion of the input compressed data stream represents:

(a) a reference to a previously compressed data word and difference data defining a difference between a currently compressed data word and that previously compressed data word; P18338.GBP 25 (b) data representing a compression parameter defining a degree of similarity, between a data word to be compressed and a previously compressed data word, required for that data word to have been compressed by reference to the previously compressed data word; or (c) data defining a data word for output; and an output circuit responsive to a compression parameter defined by a data portion of type (b) and operable either:

---1 in the case of the data portion being of type (a), to retrieve the previously compressed data word from the data memory and to combine it with the difference data to i o generate an output data word; or in the case of the data portion being of type (c), to output the data word defined by the data portion.

15. A method of data compression for compressing input data words into an output compressed data stream, the method comprising the steps of. storing in a data memory at least a part of each of a plurality of mostrecentlyreceived input data words; comparing each input data word with data words stored in the data memory to detect whether a match condition exists, a match condition being defined by an input data word being similar to a data word stored in the data memory to within a degree of similarity defined by a compression parameter; encoding an output data stream portion in respect of an input data word for which a match condition exists by reference to a data word stored in the data memory and difference data defining a difference between that input data word and that word in the data memory; and analysing properties of the input data words to determine a compression parameter to be used by the comparing logic.

16. A method of data decompression for decompressing an input compressed data stream into output data words, the method comprising the steps of. storing at least a part of a plurality of most-recently decompressed input data words; detecting whether a currently received data portion of the input compressed data stream represents:

P/833S.GBP 26 (a) a reference to a previously compressed data word and difference data defining a difference between a currently compressed data word and that previously compressed data word; (b) data representing a compression parameter defining a degree of similarity, between a data word to be compressed and a previously compressed data word, required for that data word to have been compressed by reference to the previously compressed data word; or (c) data defining a data word for output; and in response to a compression parameter defined by a data portion of type (b), i o generating output data words by either:

in the case of the data portion being of type (a), retrieving the previously compressed data word from the data memory and combining it with the difference data to generate an output data word; or in the case of the data portion being of type (c), outputting the data word defined 15 by the data portion.

17. A compressed data stream comprising:

successive data portions at least some of which represent a reference to a previously compressed data word and difference data defining a difference between a 20 currently compressed data word and that previously compressed data word; and data representing a compression parameter defining a degree of similarity, between a data word to be compressed and a previously compressed data word, required for that data word to have been compressed by reference to the previously compressed data word.

18. A storage medium carrying a compressed data stream according to claim 17.

19. Data compression apparatus substantially as hereinbefore described with reference to the accompanying drawings.

20. A method of data compression, the method being substantially as hereinbefore described with reference to the accompanying drawings.

P/833S.GBP 27

21. Data decompression apparatus substantially as hereinbefore described with reference to the accompanying drawings.

22. A method of data decompression, the method being substantially as hereinbefore 5 described with reference to the accompanying drawings.

23. A computer program having program code for carrying out a method according to claim 15, claim 16, claim 20 or claim 22.

io

24. A carrier medium carrying a computer program according to claim 23.

25. A medium according to claim 24, the medium being a storage medium.