GB2360915A

GB2360915A - Run length compression encoding of selected bits of data words

Info

Publication number: GB2360915A
Application number: GB0018255A
Authority: GB
Inventors: Jason Charles Pelly; Stephen Mark Keating
Original assignee: Sony United Kingdom Ltd
Current assignee: Sony Europe Ltd
Priority date: 2000-03-30
Filing date: 2000-07-25
Publication date: 2001-10-03
Anticipated expiration: 2020-07-25
Also published as: GB0007781D0; GB0018255D0; GB2360915B

Abstract

In known run length compression encoders the whole of an input data word is compared with versions 130 of the most recently transmitted data words stored in memory, and where a run of identical words occur the position in memory of this run is transmitted, which is smaller than the raw data. Some systems of this type are based on the work of Lempel/Ziv, such as ALDC. According to the present invention only preselected bit positions are compared 120 with the corresponding bit positions of prior data words rather than the whole word. If a run of data words with identical predefined bit positions are found, then the memory location of this run is transmitted 140 in place of the raw data in those bit positions, effectively encoding those bit positions only. The data in the remaining bit positions which have not been subject to the comparison are left uncompressed 110 or compressed by some other mechanism before transmission 140, or discarded. The choice of bit position to compare/run length encode may be fixed or may vary adaptively with the data being transmitted.

Description

P/8340.GBP 1 2360915 DATA COMPRESSION This invention relates to data

compression.

Data compression is widely used in the storage and transmission of data such as general computer data and images.

Data compression algorithms operate on an input data stream to generate a compressed data stream, which is then decompressed to form an output data stream. They can generally be subdivided into two categories, lossy algorithms and lossless algorithms.

Lossy algorithms, such as the MPEG video compression algorithm, are commonly io applied to image data and do not guarantee that the output data stream will be an exact replica of the input data stream. On the other hand, so- called lossless algorithms do recreate the input data stream exactly at decompression.

Lossless algorithms commonly used for compressing computer data include the so-called "ZIF' and "ALDW algorithms based on the work of Lempel and Ziv originally described in the paper, "A Universal Algorithm for Sequential Data Compression7, MEE Trans Inform Theory, vol. IT-23, No 3, pp337-343, 1977. ALDC is further described in the ANSI standard for Information Technology, "Adaptive Lossless Data Compression (ALDC)11, X3135/95-318A, December 1995.

In Lempel and ZWs algorithms, multi-bit input data words are stored in a history buffer which contains (typically) the 512 most recent such words. As each data word is received, it is compared to the words already stored in the history buffer. If no matches are found, which is to say that the newly received data word is not the same as any data words stored in the history buffer, or if the match to the history buffer is only one word long, then the new data word is added to the compressed data stream as a "literal", that is, without any encoding being applied. However, if two or more successive data words match the same string of data words in the history buffer then a "copy pointer" is added to the compressed data stream. The copy pointer indicates a start position in the history buffer and the length of the matching string of data words. This algorithm therefore exploits a repetitive nature often found in computer data - if string matches are found frequently then good compression is achieved as a copy pointer is arranged to require fewer bits in the compressed data stream than the original data. However, if string matches are not found sufficiently frequently, these algorithms can actually worsen the data rate rather than compress it, as the compressed data stream then consists of data P/8340.GBP 2 words transmitted as literals plus a certain amount of overhead and synchronisation information.

Another class of lossless algorithm which works particularly well for image data is the so-called "Rice" algorithm described in the paper, "A VLS1 Chip Set for High-Speed Lossless Data Compression", J Venbrux et al, MEE Trans on Circuits & Systems for Video Tech, vol. 2, No 4,1992.

The Rice algorithm requires input data to be passed through a data modeller such as a differential pulse code modulation (DPCM) modeller before encoding. The modeller is designed to weight the distribution of the data towards a predominance of zero-valued data words. Coding then consists of adding the n least significant bits (LSBs) directly to the compressed data stream and performing fundamental sequence encoding on the remaining most signiflicant bits (MSBs). While the Rice algorithm can work very well in particular circumstances, it is dependent upon the modelling algorithm matching the characteristics and statistical properties of the input data stream to be compressed. The Rice algorithm is therefore far from being a generic algorithm - instead, it must be well matched to the nature of the input data stream in use.

This invention provides data compression apparatus for compressing input m-bit data words into an output compressed data stream, the apparatus comprising:

a data memory for storing at least (m-n) predetermined bit positions of a plurality of most-recently-received input data words, where n is an integer defined by 1:5 n < m; comparing logic for comparing the (m-n) predetermined bit positions of each input data word with corresponding bit positions of data words stored in the data memory to detect whether a match exists; a detector, responsive to the comparing logic, for detecting whether there exists a match in respect of (m-n) predetermined bit positions of z or more consecutive input data words and corresponding consecutively received entries in the data memory, where z is an integer greater than 1; and output logic operable to generate compressed output data, where:

(a) if the detector detects a match of the (m-n) predetermined bit positions of z or more consecutive input words and corresponding consecutively received entries in the data memory, the compressed output data comprises data indicating the positions of the matching entries in the data memory; and P/8340.GBP 3 (b) if the detector detects that a current input data word does not form part of a match of the (m-n) predetermined bit positions of z or more consecutive input words and corresponding consecutively received entries in the data memory, the output compressed data comprises data defining the m bits of the current input data word.

The invention represents an improvement over both the general ALDC algorithm and the data-type-specific Rice algorithm.

input data words are stored in a data memory (e.g. a history buffer), but instead of requiring an exact match for strings of words stored in the buffer (as in ALDC), only predetermined bit position such as the most significant bits are matched. The other bits - for example, the n least significant bits, are preferably added to the output compressed data stream in a form which does not depend on matches with the history buffer.

However, if they are not, an advantageous lossy algorithm is obtained.

This arrangement has the advantage over ALDC that with (for example) noisy image data more inatches can be found, as the noise often affects mainly the least significant bits. This can in turn lead to a better degree of compression, as longer matches are encoded more efficiently than shorter matches or "literals".

The value of n can be predetermined and fixed, or can be selected by the user in response to the data type in use. This allows some adaptation to the data type while still using the same underlying algorithm.

The (m-n) predetermined bit positions are preferably the (m-n) MSBs, as in many classes of data these are the bit positions most likely to remain fairly constant between samples. However, other arrangements are possible. Consider for example a digital video signal formed by mixing an 8-bit noisy (real) source such as a camera, VTR etc with a 10 bit signal derived from a "clean" quasi noise-free source such as computer graphics, a caption generator or the like. This would result in a 10-bit video signal in which the 2 LSBs were not noisy, being derived only from the clean source, but the next few significant bits (corresponding to the LSBs of the noisy source) were noisy. In this case the technique could provide advantageous compression by ignoring one or more bits from bit 3 upwards.

Further respective aspects and features of the invention are defined in the appended claims.

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, throughout which like parts are indicated by like references, and in which:

P18340.GBP 4 Figure 1 is a schematic diagram of a computer network employing lossless data compression; Figure 2 is a schematic diagram of a first embodiment of a data compression encoder, to be referred to as an "SMY' encoder; Figure 3 is a schematic diagram of an SZIP decoder complementary to the encoder of Figure 2; Figure 4 is a schematic diagram of a second embodiment of a data compression encoder, to be referred to as a 'TZIY' encoder; Figure 5 is a schematic diagram of a DZIP decoder complementary to the encoder io of Figure 4; Figure 6 schematically illustrates an adaptive SZIP encoder; Figure 7 schematically illustrates a part of the encoder of Figure 6; Figure 8 schematically illustrates an adaptive WIP encoder; Figure 9 schematically illustrates a control circuit of the encoder of Figure 8; and Figure 10 schernatically illustrates an adaptive SZIP or DZIP decoder complementary to the respective encoder of Figure 6 or Figure 8.

Figure 1 is a schematic diagram of a computer network em'loying lossless data P.

compression. A server computer 10 comprising a processing unit 60 and a display 65 is linked to a client computer 20 by a network connection 30. Data stored on a data storage medium 40 at the server, such as a hard disc store, a RAID store, an optical disc store or a tape store is communicated, via a network interface card 50 in the processing unit 60 of the server to the network connection 30. At the client, the communicated data is received by a network interface card 70 in a processing unit 80 of the client.

The network interface card 50 is arranged to perform lossless data compression and the network interface card 70 is arranged to perform a complementary lossless decompression process.

Because the compression and decompression are lossless, the data received at the output of the decompression process by the processing unit 80 are identical to the data retrieved by the server from the data store 40. However, the compression process means that data traffic on the network connection 30 is less than would be the case if the data were transmitted without compression.

11SZW' Encoder P/8340.GBP 5 Figure 2 is a schematic diagram of a first embodiment of a data compression encoder, to be referred to for convenience as an "SZIP" encoder. This example of an encoder and subsequently described embodiments in the present application are shown operating on 8 bit input data words, but it will be clear that other types of input data word could be used.

Each input data word is passed to a demultiplexer 100 which, in response to a control input value "n", where n is an integer no less than zero and no more than the number of bits per, word (8 in this example), splits the input data word into two subwords, one formed of n least significant bits (LSBs) and the other formed of (8-n) most io significant bits (MSBs).

The n LS13s are stored in a buffer 110 for the duration of the processing applied to the remaining MSBs.

The (8-n) MSBs are passed to a compare processor 120 which compares them with corresponding MSBs of entries in a history buffer 130.

The history buffer 130 is similar to the buffer used in the ALDC algorithm and may contain, for example 512, 1024 or 2048 data words, although the skilled man will appreciate that the choice of size of the history buffer is a matter of routine. The history buffer is updated by adding the current input data word, and discarding (once the history buffer is full) the least recently stored data word, each time a current data word has been processed. In the present embodiment, the history buffer stores all eight bits of each input data word. This has advantages in allowing the value of n to be varied during the data compression process (see the description of Figure 6 below). However, if the value of n is fixed in a predetermined manner, the history buffer could be arranged to store only the (8n) MSBs of each input data word. Having said this, data memories of the order of 512 to

2048 bytes are cheap, even when implemented (as in these examples) on an application specific integrated circuit (ASIC) or field programmable gate array (FPGA), so the present embodiment stores all eight bits to allow a greater flexibility of design and use.

The operations carried out by the compare processor 120 involve searching the history buffer for strings of data words having (8-n) MSBs matching the corresponding (8-n) MSBs of the incoming data stream. The compare processor identifies where the longest match has occurred. If the length of this match is greater than one data word long (although in general the minimum number of matches may be made greater than 2), then details defining the match in the history buffer are encoded as a compressed word. The compressed word encodes the length of the match and its position within the buffer. If no P/8340.GBP 6 matches occur, or if the match is only one data word long, then the 8-n MSBs of the input data word are encoded as a "literal", that is to say, raw data.

Compression is therefore achieved by encoding several data words within a single compressed word. Markers are required to inform a subsequent decoder whether a compressed word or a literal word has been encoded.

Where a compressed word is transmitted, the decoder (to be described in greater detail below) accesses a string of data words at the corresponding position in a history buffer maintained at the decoder. The decoder adds each decoded data word to its history buffer as soon as it is decoded, and so in this way maintains an identical history buffer to lo that used at the encoder.

So, as regards the (8-n) MSBs, the processing applied to that sub-word is in fact similar to the ALDC algorithm. Indeed, if the variable n were set to zero, the encoder of Figure 2 would operate as an ALDC encoder. Accordingly, architectures which have been proposed for the ALDC algorithm, such as an architecture defined in US-A-5 652 878, may be used or adapted for the compare processor and history buffer.

The output of the compare processor 120 is therefore either a literal, being the (8 n) MSBs in the case that a match of two or more words has not been found, or a compressed word where a match of at least two words has been found. When such an output is generated by the compare processor 120, the compare processor controls the buffer 110 to output the LSBs which have been temporarily stored in the buffer. These are output in an uncompressed or "literal" form and passed, with the output of the compare processor 120, to an output multiplexer 140 which generates an output data stream. The syntax of the output data stream will be described below.

In common with the ALDC architecture, the length of match in the history buffer is transmitted as a Huffrnan code of 2, 4, 6, 8 or 12 bits using the coding values set out m the following table.

P/8340.GBP 7 Contents of Length Code Match Count in Bytes Field

00 2 01 3 1000 4 1011 7 110000 8 110111 15 11100000 16 11101111 31 11110000 0000 32 1111 1110 1110 270 1111 1110 1111 271 It will be seen from the above table that the maximum match count available with this code is 271. Accordingly, the compare processor is arranged to force an output if the length of match reaches 271 data words. However, it will of course be appreciated that other encoding schemes using different length or entirely different codes or coding arrangements could be employed.

If the minimum number of matches (see above) is made greater than 2, the 270 distinct codes provided in the above table can instead be mapped to a different range of io match lengths or counts. So, for example, if the minimum match length were 7, the above codes could be mapped to match lengths of 7 to 276.

The output data syntax is as follows. Each compressed data word is either a literal or a copy pointer. The first bit of the compressed data forms a flag to indicate which of these is in use.

Literal: compressed data 0 P/8340.GBP 8 where each is one bit of the eight bit input word, taking the value 0 or 1.

Copy Pointer: compressed data 1 <length code><displacement><LSBs> where <length code> Huffinan code from above table.

<displacement> ... , i.e. a binary number defining a position within the history buffer. The number of bits is selected to be sufficient to encode all possible positions within the history buffer, e.g. 11 bits for a 2048-word history buffer.

<LSBs> ... , i.e. length n bits 118ZIP11 Decoder Figure -31 is a schematic diagram of an SZIP decoder complementary to the encoder of Figure 2.

The decoder comprises an input demultiplexer 200, a control circuit 210, a history buffer 220, an LSB buffer 230 and an output multiplexer 240.

The input demultiplexer 200 is responsive to the parameter n and also to control flags within the input compressed data stream, to separate the unencoded LSBs from the encoded MSBs of the input data stream. These flags are derived by the control circuit 210 and serve to indicate to the input demultiplexer how many LSBs to extract from the input data stream.

The unencoded LSBs are supplied to an LSB buffer 230 where they are stored until an output is forced by the control circuit 210.

As described above, data representing the MSBs is formed of either literals, in which the MSBs are simply transmitted without flirther encoding, or compressed data words giving a start position and string length within the history buffer.

In the case of a literal, the control circuit simply forwards the (8-n) MSBs to the output multiplexer 240 and forces an output of the corresponding LSBs from the HB buffer 230 to the output multiplexer 240. The output multiplexer combines the MSBs and LSBs to form an output data word. The output data word is also written to the history P/8340.GBP 9 buffer 220 and, once the history buffer is full, the least recently added entry is discarded from the history buffer.

In the case of a compressed data word representing the MSBs, the control circuit 210 decodes the start position and string length from the compressed data word and accesses the appropriate entries in the history buffer 220. The (8-n) MSBs of the accessed entries in the history buffer are passed back to the control circuit which forwards them to the output multiplexer 240. At the same time, the control circuit forces an output from the LSB buffer 230 and the output data words are thus formed by the output multiplexer 240 concatenating the MSBs and LSBs. The history buffer is also updated with each output io data word.

In a similar manner to the history buffer of the encoder of Figure 2, the history buffer 220 could be arranged to store only the (8-n) MSBs of each decoded data word, in which case the update signal to the history buffer could be obtained from the output of the control circuit 2 10. In the present embodiment, however, for maximum flexibility and the other reasons discussed earlier, the history buffer 220 is updated with the entire output data words generated by the output multiplexer 240.

11M1P11 Encoder Figure 4 is a schematic diagram of another embodiment of a data compression encoder, to be referred to for convenience as a "DZIP" encoder. The encoder comprises a difference detector 300, pair of comparators 310, 320, calculation logic 330, a history buffer 340, a buffer 350 and an output circuit 360. The difference detector, comparators and buffer are shown grouped together as a comparison unit 370.

The history buffer 340 is similar to the buffer used in the ALDC algorithm and the buffer 130 of Figure 2 and may contain, for example 512, 1024 or 2048 data words, although the skilled man will appreciate that the choice of size of the history buffer is a matter of routine. The history buffer is updated by adding the current input data word, and, once the history buffer is full, discarding the least recently stored data word, each time a current data word has been processed.

From the value of n currently in use, the logic 330 derives the values of two further variables u'and v, either by direct calculation or from a look-up table. The variables u and v are set to zero if n = 0. Otherwise, they are defined as follows:

U = 21 P/8340.GBP 10 2 n-1 _1 These definitions can be swapped round if desired.

As in Figure 2, the basic principle underlying this embodiment is the history buffer is searched by the comparison unit 370 for strings of data words matching the s incoming data stream. The encoder identifies where the longest match has occurred. If the length of this match is greater than one data word long (although in general the minimum match length may be made greater than 2), then details defining the match in the history buffer are encoded as a compressed word. The compressed word encodes the length of the match-and its position within the buffer. If no matches occur, or if the match io is only one data word long, then the input data words are encoded as a "literal", that is to say, raw data. In carrying out this matching process, however, an input data word x is defined as being identical to a symbol y in the history buffer 340 if.

X-V ≤ y ≤ x+u that is to say, if the numerical difference between the binary values y and x is within certain limits.

In order to assess whether this inequality is true, the difference detector 300 first derives the numerical difference between the current input data word and a current word to be examined from the history buffer. (This process is repeated or carried out in parallel for other entries in the history buffer, but for clarity the processing will be described with reference to a single entry). This generates a value (y-x) which is passed to the two comparators 310, 320.

The comparators compare the value (y-x) with u and v, to detect whether the following two conditions apply:

Is (y-x) less than or equal to u? (comparator 3 10) Is (y-x) greater than or equal to -v? (comparator 320) If the two conditions both apply then the current input data word is the same or sufficiently similar to the history buffer entry to be considered a cc match".

The buffer 50 stores details of entries considered to be a match, along with the difference values (y-x) obtained in respect of those entries.

Of course, the polarity of the operation of the difference detectors and comparators, or indeed the specific arrangement shown in Figure 4, can be altered without affecting the underlying principle that an input data word is accepted as a match with a P/8340.GBP 11 history buffer entry if it is within a certain numerical range of that history buffer entry. It is preferred that the range is defined at least in part by a control parameter (e.g. the value n in this case) although the ranges could be fixed and predetermined.

So, for example, a simple equivalent is that the difference detector could output the value (x-y) and the comparator polarities be reversed. The dependence of u and v on the control parameter n could be practically any analytical, linear, non-linear or other relationship, possibly being defined by a look-up table. The range of values within which a match is obtained could be one-sided with respect to the history buffer entry or extend above and below the buffer entry as in this example.

Strings of "matches" (i.e. the same value or values close to one another as discussed above) are built up in much the same way as the ALDC encoder or the encoder of Figure 2. The output circuit 360 identifies where the longest match has occurred. Compression is therefore achieved by encoding several data words within a single compressed word. Markers are required to inform a subsequent decoder whether a compressed word or a literal word has been encoded.

Where a compressed word is transmitted, the decoder (to be described in greater detail below) accesses a string of data words at the corresponding position in a history buffer maintained at the decoder. The decoder adds each decoded data word to its history buffer as soon as it is decoded, and so in this way maintains an identical history buffer to that used at the encoder.

So, apart from the fact that a match is defined in a fundamentally different way, which in turn leads to a different way of encoding the compressed output data, the processing applied to each input data word is in some respects similar to the ALDC algorithm. Indeed, if the variable n were set to zero, the encoder of Figure 4 would operate as an ALDC encoder. Accordingly, architectures which have been proposedfor the ALDC algorithm, such as an architecture defined in US-A-5 652 878, may be used or adapted for the output circuit and history buffer.

The output of the output circuit 360 is therefore a literal, being the input data word in the case that a match of two or more words has not been found, or a compressed word where a match of at least two words has been found. The syntax of the output data strewn will be described below.

in common with the ALDC architecture and the encoder of Figure 2, the length of match in the history buffer is transmitted as a Huffinan code of 2, 4, 6, 8 or 12 bits using the coding values set out in the following table.

P/8340.GBP 12 Contents of Length Code Match Count in Bytes Field

00 2 01 3 1000 4 1011 7 110000 8 110111 15 11100000 16 11101111 31 11110000 0000 32 1111 1110 1110 270 1111 1110 1111 271 It will be seen from the above table that the maximum match count available with this code is 271. Accordingly, the compare processor is arranged to force an output if the length of match reaches 271 data words. Of course, other coding schemes, different length codes or entirely different codes could be used.

If the minimum number of matches (see above) is made greater than 2, the 270 distinct codes provided in the above table can instead be mapped to a different range of match lengths or counts. So, for example, if the minimum match length were 7, the above io codes could be mapped to match lengths of 7 to 276.

Literal: compressed data 0 P/8340.GBP 13 where each is one bit of the eight bit input word, taking the value 0 or 1.

Copy Pointer: compressed data 1 <length code><displacement><differences> where <length code> Huffinan code ftom above table.

<differences> , i.e. a binary number defining the numerical difference between each input data word and the corresponding entry in the string to be accessed in the history buffer. The number of bits is selected to be sufficient to encode all possible differences, which in turn depends on the value of n. In the example above, each difference value requires n bits to encode. The encoding of the positive and negative differences makes use of an encoding arrangement described below.

Encoding Difference Values The difference values are encoded so that the numerically lowest difference value is coded as 0 and the highest difference value with 2' - 1, all using n bits. So, for example, in DZIP(2), the possible difference values are -1, 0, 1, 2. These are encoded as:

0 -31 01 1 10 2 11M1P11 Decoder P/8340.GBP 14 Figure 5 is a schematic diagram of a WIP decoder complementary to the encoder of Figure 4.

The decoder comprises an input demultiplexer 400, a control circuit 410, a history buffer 420, a difference data detector 430 and an output adder 440.

The input demultiplexer 400 is responsive to control flags within the input compressed data stream (as detected by the control circuit 410), to separate data defining matches and literals from difference data of the input data stream. Difference data is routed to the difference data detector 430. Other data defining matches or literals is routed to the control circuit 410. As described above, data representing the original data io words is formed of either literals, in which the data words is simply transmitted without further encoding, or compressed data words giving a start position and string length within the history buffer.

In the case of a literal, the control circuit 410 simply forwards the data word to the output adder 440. The output of the difference data detector430 is zero, since no difference data is provided with a literal. The output adder 440 thus adds zero to the data word and so outputs that data word. The output data word is also written to the history buffer 420 and the least recently added entry is discarded from the history buffer.

In the case -of a compressed data word, the control circuit 410 decodes the start position and string length from the compressed data word and accesses the appropriate entries in the history buffer 420. The accessed entries in the history buffer are passed back to the control circuit which forwards them to the output adder 440. At the same time, the difference values corresponding to those words are derived from the input data word's difference data and are passed to the output adder. The history buffer entries and difference values are then added by the output adder 440 to form respective output data words. The history buffer is also updated with each output data word.

As the parameter n is varied, the minimum run length considered to be a string match is preferably also varied. This is because larger values of n result in the need for larger numbers of bits to encode the Ms in an $ZIP data stream, or the differences in a WIP data stream. So, the run length at which it becomes beneficial (in terms of quantity of output data) to encode a copy pointer rather than a series of literals also changes.

Using the parameters and the example given above, the preferred minimum run length L for SZEP and WIP is the smallest integer such that L>( 12 / (9-n)) P/8340.GBP 15 The values for each n is given below:

n L 0 2 1 2 2 2 3 3 4 3 4 6 5 7 7 Adaptive "SZW' Encoder Figure 6 schematically illustrates an adaptive SZIP encoder.

The encoder of Figure 6 comprises an input demultiplexer 700, an array of bit change detectors 710, an array of counters 720, an array of comparators 730, an analyser 10740, an optional delay unit 750 and an SZIP (n) encoder 760. The SZIP (n) encoder 760 may be of the forin shown in Figure 2, and carries out SZIP encoding of input data according to a parameter n, where n specifies the number of LSBs not to take part in the comparison process with the contents of the history buffer. The output of the SZIP (n) encoder 760 is compressed data as described in detail above.

The basic operation of the apparatus of Figure 6 is that a value of n is selected which is appropriate to input data currently being received. If the delay unit 750 is incorporated so as to delay the input data while this derivation of the most appropriate value of n is taking place, then that value of n can be applied to the input data from which it was derived, for example on a block-by-block basis. Alternatively, if the delay unit 750 is omitted (but the bypass data path to the SZIP(n) encoder retained), then a value of n derived in respect of a preceding block is applied to the compression of data of a current block. This latter arrangement need not be a disadvantage because the nature of data to be compressed does not tend to change rapidly and frequently from, for example, computer P/8340.GBP 16 data to image data and vice versa. In fact the absence of a buffer avoids an overall processing delay through the system beyond that applied by the SZIP (n) encoder 760.

In operation, the input demultiplexer 700 splits each input data word into its constituent bits, in this example eight bits from an LS13 through to an MSB. Each bit is supplied to a respective change detector 701. Figure 7 schematically illustrates the processing applied to each bit and shows the change detector 710 in more detail. The change detector comprises a one-bit delay 712 and a comparator 714 arranged so that a current bit value is compared with the immediately preceding bit value. This arrangement generates one of two outputs, namely a first output indicating that the current bit and the i o delayed bit are identical, or a second output indicating that they are not the same.

For each bit, the output of the change detector is supplied to a counter 720, again shown in more detail in Figure 7. The counter 720 counts the number of times that the current bit and preceding bit are detected to be identical. A block reset signal can be used to clear the counter, and this signal is generated by the analyser 740 when a value of n is 15 output to the SZIP (n) encoder 760.

When an appropriate number of bits have been tested by the change detector 710 and the counter 720, the analyser 740 (via a "synC control shown in Figure 7) causes the count values for each bit to be compared with a respective threshold by the comparator 730. The thresholds can be different for each bit position from MSB to LS13. For a block 20 of, say, 2048 input data words tested in this way, a set of thresholds might be as follows:

P/8340.GBP 17 Bit: LSB 1 2 3 4 5 6 MSB Threshold: 1372 1372 1372 i 1372] 1372 1372 1372 1372_ Of course, different threshold values for each bit can be used if desired.

The intention behind this arrangement is to detect which bits of the input data words tend to be identical from data word to data word and which behave "noisily" by varying a lot between data words. So, if the number of times that a bit is detected to be identical to the preceding bit (the count value from the counter 720) is lower than the respective threshold value, that bit position is determined to be a "noisy" bit position and the value of n can be selected so as to exclude that bit position from the matching process.

To illustrate this point, the following are example count values:

The first three LSBs are detected by this process to be "noisy", in that the number of times that they are identical from bit to bit within a test block is lower than a respective threshold value. Accordingly, n is set to three by the analyser 740 so as to exclude those bits from the matching process with the history buffer.

The arrangement described above therefore provides an adaptive encoder in which a value of n can be selected, on a block-by-block basis, so as to be appropriate to the input data in use. It is noted that the process does not analyse the success of the compression process carried out by the SZIP (n) encoder, but instead derives the appropriate value of n by analysis of the input data itself. If the optional delay unit 750 is used, then that value of n can be applied to the actual data from which it was derived. If the delay unit is not used, there is a one or more block lag between the generation and use of the appropriate value of n, but this need not be a problem, especially where the block size is set to be much lower than the typical rate of variation within data files.

The value of n which is used in the encoding of data by the adaptive encoder described above needs to be added to the data stream so that information can be extracted by the decoder for use in decoding the compressed data. The value of n is encoded into the data stream using socalled marker codes defined by the ALDC standard.

Bit: LSB 1 2 3 4 5 6 MSB Count: 1010 M10 1212 1400 1750 1600 2000 2040 P/8340.GBP is InALDC there are 16 marker codes 1 1111 11110000 -3, 1 1111 1111 1111. The last code in this range signifies an end of file (EOF). The others are classified in the standard as reserved, but are used here as follows to signify a new value of the parameter n:

1 1111 1111 0000 signifies n--0 1 1111 1111 0001 signifies n=l and so on.

This allows values of n up to 15 to be encoded directly. Higher values of n can be encoded by concatenating two marker codes, so that a first marker code indicates a lo change of numerical range (e.g. to 15-30) for subsequent marker codes.

The selection of a new value of n (if indeed it is different to the current value) is arranged to force an output by the encoder, which might otherwise have been part way through a matching process. However, the system could instead be arranged to wait until the next literal is output.

Adaptive "DZIP" Encoder Figure 8 schernatically illustrates an adaptive WIP encoder.

The encoder of Figure 8 comprises a subtractor 800, a one-word delay 810, an absolute value detector 820, astatistical analyser 830, a control circuit 840, an optional delay unit 850 and an DZIP (n) encoder 860. The WIP (n) encoder 860 may be of the form shown in Figure 4, and carries out DZIP encoding of input data according to a parameter n, where n specifies (indirectly) the numerical range or tolerance within which a history buffer entry is considered to be a match to an input data word. The output of the DZIP (n) encoder 860 is compressed data as described in detail above.

The basic operation of the apparatus of Figure 8 is that a value of n is selected which is appropriate to input data currently being received. If the delay unit 850 is incorporated so as to delay the input data while this derivation of the most appropriate value of n is taking place, then that value of n can be applied to the input data from which it was derived, for example on a block-by-block basis. Alternatively, if the delay unit 850 is omitted (but the bypass path to the DZIP(n) encoder retained), then a value of n derived in respect of a preceding block is applied to the compression of data of a current block.

This latter arrangement need not be a disadvantage because the nature of data to be compressed does not tend to change rapidly and frequently from, for example, computer P/83 A0WP 19 data to image data and vice versa. In fact it avoids an overall processing delay through the system beyond that applied by the WIP (n) encoder 860.

In operation, the subtractor 800 and one-word delay 810 detect the numerical difference between each input data word and the immediately preceding data word. The absolute value of this difference is obtained by the absolute value detector 820 and is passed to the statistical analyser 830.

- When an appropriate number of data word difference values have been received by the statistical analyser 830, the statistical analyser 830 derives the mean and variance of the absolute difference values and passes these to the control circuit 840. The io statistical analyser can be arranged to do this in response to a "block reseC signal, which may be generated simply by a resettable counter (not shown) or by a block synchronising signal elsewhere in the system. The block reset signal is also arranged to cause the control circuit to calculate a new value of n on the basis of the statistical information received from the statistical analyser 830.

The control circuit 840 is partly illustrated schernatically in more detail in Figure 9. As mentioned above, at the end of statistical analysis of a block or group of input data words, the control circuit receives the mean value and the variance value for the absolute difference values obtained between adjacent data words in that group. The mean and variance are each compared to respective threshold values (by a set of comparators 870) corresponding to values of n from 0 to 7. Figure 9 schematically illustrates this process as applied to the mean absolute difference value. A selector 880 selects a new value of n in accordance with the following rules:

Let "np" signify the new parameter n. "np" is initialised to the current value of n being used. Then, various comparisons are made with the mean and variance:

IF (variance <1 OR mean < 1) then np = 0 ELSE IF (mean <-- 1.5) then np =2 ELSE IF (mean ≤ 4) then np =3 ELSE IF (mean ≤ 8) then np =4 ELSE IF (variance > 100 AND mean <-- 20) then np = 5 ELSE np = 3 The intention behind this arrangement is to detect the general degree of variation within the input data so that a value of n can be selected which is likely to give a predominance of matches, but without wasting unnecessary data on difference data.

P/8340.GBP 20 To illustrate the operation of the encoder of Figure 8, example mean and variance values for the absolute differences between adjacent data words are as follows:

Mean = 3.2 Variance = 10.3 From the thresholds in the above scheme, it can be seen that np = 3 The arrangement described above therefore provides an adaptive encoder in which a value of n can be selected, on a block-by-block basis, so as to be appropriate to the input data in use. It is noted that the process does not analyse the success of the compression process carried out by the WIP (n) encoder, but instead derives the appropriate value of n i o by analysis of the input data itself. If the optional delay unit 850 is used, then that value of n can be applied to the actual data from which it was derived. If the delay unit is not used, there is a one or more block lag between the generation and use of the appropriate value of n, but this need not be a problem, especially where the block size is set to be much lower than the typical rate of variation within data files. Also, not including the delay unit can reduce the overall processing delay through the system.

The value of n which is used in the encoding of data by the adaptive encoder described above needs to be added to the data stream so that information can be extracted by the decoder for use in decoding the compressed data. The value of n is encoded into the data stream using so-called marker codes defined by the ALDC standard.

In ALDC there are 16 markercodes 1 1111 11110000 -)h 1 1111 1111 1111. The last code in this range signifies an end of file (EOF). The others are classified in the standard as reserved, but are used here as follows to signify a new value of the parameter n:

1 1111 11110000 signifies n--0 1 1111 11110001 signifies n=l and so on.

This allows values of n up to 15 to be encoded directly. Higher values of n can be encoded by concatenating two marker codes, so that a first marker code indicates a change of numerical range (e.g. to 15-30) for subsequent marker codes.

P/8340.GBP 21 Adaptive Decoders Figure 10 schematically illustrates an adaptive SZIP or WIP decoder complementary to the respective encoder of Figure 6 or Figure 8.

Figure 10 shows an "n" value detector 900 and an SZIP (n) or WIP (n) decoder 9 10. The decoder 9 10 may be a decoder as shown in Figure 3 or Figure 5.

Operation of this adaptive decoder arrangement is straightforward, in that the "n" value detector 900 detects encoded values of n from the compressed data stream by searching for appropriate codes such as those defmed above. When a new value of n is lo detected, this is passed to the decoder 910 for use in decoding subsequent compressed data. Parts of the compressed data stream which do not correspond to the encoding of a new value of n are passed directly by the "n" value detector 900 to the decoder 910 for decoding.

The skilled man will appreciate that the embodiments in this description may be implemented as hardware, programmable or custom hardware such as an ASIC or FPGA, a mixture of hardware and software or purely by software running on a known data processing apparatus. Where the implementation may involve software, it will be appreciated that the software and a storage medium holding some or all of that software are also considered to be embodiments of the present invention.

P/8340.GBP 22

Claims

1. Data compression apparatus for compressing input m-bit data words into an output compressed data stream, the apparatus comprising:

a data memory for storing at least (m-n) predetermined bit positions of a plurality of mostrecently-received input data words, where n is an integer defined by 1:5 n < m; comparing logic for comparing the (m-n) predetermined bit positions of each input data word with corresponding bit positions of data words stored in the data memory to detect whether a match exists; a detector, responsive to the comparing logic, for detecting whether there exists a match in respect of (m-n) predetermined bit positions of z or more consecutive input data words and corresponding consecutively received entries in the data memory, where z is an integer greater than 1; and output logic operable to generate compressed output data, where:

(a) if the detector detects a match of the (m-n) predetermined bit positions of z or more consecutive input words and corresponding consecutively received entries in the data memory, the compressed output data comprises data indicating the positions of the matching entries in the data memory; and (b) if the detector detects that a current input data word does not form part of a match of the (m-n) predetermined bit positions of z or more consecutive input words and corresponding consecutively received entries in the data memory, the output compressed data comprises data defining the m bits of the current input data word.

2. Apparatus according to claim 1, in which, in condition (a), the compressed output data comprises data defining the n LS13s for each of the input words.

3. Apparatus according to claim 2, in which the data defining the n LS13s for each of the input words in a matching sequence comprises those n bits of each input word.

4. Apparatus according to any one of claims 1 to 3, in which z = 2.

P/8340.GBP 23

5. Apparatus according to any one of the preceding claims, in which the data memory is arranged to store all m bits of the plurality of most- recently-received input data words.

6. Apparatus according to any one of the preceding claims, in which:

the data memory is arranged to store the plurality of most-recentlyreceived input data words in an ordered sequence dependent upon the order of processing of the data words; and the data generated by the output logic indicating the positions of the matching entries in the data memory comprises data defining a position and an extent of the sequence of matching entries in the ordered sequence.

7. Apparatus according to any one of the preceding claims, in which the data defining the m bits of the current input data word comprises those m bits of that data word.

8. Apparatus according to any one of the preceding claims, in which the (m-n) predetermined bit positions are (m-n) MSBs.

9. A method of data compression for compressing input m-bit data words into an output compressed data stream, the method comprising the steps of.

storing at least (m-n) predetennined bit positions of a plurality of mostrecentlyreceived input data words, where n is an integer defined by 1:: n < m; comparing the (m-n) predetermined bit positions of each input data word with corresponding bit positions of data words stored in the data memory to detect whether a match exists; detecting whether there exists a match in respect of the (m-n) predetermined bit positions of z or more consecutive input data words and corresponding consecutively received entries in the data memory, where z is an integer greater than 1; and generating compressed output data, where:

(a) if the detecting step detects a match of the (m-n) predetermined bit positions of z or more consecutive input words and corresponding consecutively received entries in P/8340.GBP 24 the data memory, the compressed output data comprises data indicating the positions of the matching entries in the data memory; and (b) if the detecting step detects that a current input data word does not form part of a match of the (m-n) predetermined bit positions of z or more consecutive input words and corresponding consecutively received entries in the data memory, the output compressed data comprises data defining the m bits of the current input data word.

10, Data decompression apparatus for decompressing an input compressed data stream into output m-bit data words, the apparatus comprising:

a data memory for storing at least (m-n) predetermined bit positions of a plurality of most-recently-decompressed output data words, where n is an integer defined by 1:5 n < M; logic for detecting whether a currently received data portion of the input compressed data stream represents either:

(a) data indicating the positions of matching entries in an ordered arrangement of previously decompressed data words and data defining n bits for each of those data words, where n is an integer defined by 1:5 n < m; or (b) data defining the m bits of a data word; and an output circuit operable either:

in the case of the data portion being of type (a), to retrieve the matching entries from the data memory and to combine the (m-n) predetermined bit positions of each matching entry with a corresponding set of n bits from the data portion, to produce respective m-bit output data words; or in the case of the data portion being of type (b), to output the m bits defined by the data portion to form an m-bit output data word.

11. A method of data decompression for decompressing an input compressed data strearn into output m-bit data words, the method comprising the steps of. storing in a data memory at least (m-n) predetennined bit positions of a plurality of most-recently-decompressed output data words, where n is an integer defined by 1:5 n < M- detecting whether a currently received data portion of the input compressed data stream represents either:

P18340.GBP 25 (a) data indicating the positions of matching entries in an ordered arrangement of previously decompressed data words and data defining n bits for each of those data words, where n is an integer defined by 1 < n < m; or (b) data defining the m bits of a data word; and generating output data words by either:

in the case of the data portion being of type (a), retrieving the matching entries from the data memory and combining the (m-n) predetermined bit positions of each matching entry with a corresponding set of n bits from the data portion, to produce respective m-bit output data words; or in the case of the data portion being of type (b), outputting the m bits defined by the data portion to form an m-bit output data word.

12. A compressed data stream comprising successive data portions each being either:

(a) data indicating the positions of matching entries in an ordered arrangement of previously decoded data words and data defining n bits for each of the input words, where n is an integer defined by 1:: n < m; or (b) data defining the m bits of a data word.

13. A storage medium carrying a compressed data stream according to claim 12.

14. Data compression apparatus substantially as hereinbefore described with reference to the accompanying drawings

15. A method of data compression, the method being substantially as hereinbefore described with reference to the accompanying drawings.

16. Data decompression apparatus substantially as hereinbefore described with reference to the accompanying drawings

17. A method of data decompression substantially as hereinbefore described with reference to the accompanying drawings.

P/8340.GBP 26

18. A computer program comprising program code for carrying out a method according to claim 9, claim 11, claim 15 or claim 17.

19. A carrier medium carrying a computer program according to claim 18.

20. A medium according to claim 19, the medium being a storage medium.