GB2193866A

GB2193866A - Data compression method and apparatus

Info

Publication number: GB2193866A
Application number: GB08717349A
Authority: GB
Inventors: Dr James Hundith Williamson
Original assignee: SERIF SOFTWARE Ltd
Current assignee: SERIF SOFTWARE Ltd
Priority date: 1986-07-24
Filing date: 1987-07-22
Publication date: 1988-02-17
Also published as: GB8717349D0; GB8618093D0

Abstract

A data compression system automatically assigns identifying tokens to the data as it is processed, and uses these tokens to reference later parts of the data. The tokens represent the combinable pairs of data units and may themselves be combined. Thus each repeated sequence of data units require half the previous number tokens to represent it as compared with its previous appearance. The set of stored tokens built up during encoding, and are correspondingly available for decoding. The encoding process can be adjusted for maximum compression or for faster throughput, in which case the partially compressed files can be processed later to compress them further. The decoding stage needs no prior knowledge either of the possible content of the data nor of the tuning parameter employed during the compression. When applied to data sequences containing a high degree of redundancy, e.g. in image compression, the invention can give a logarithmic compression factor for significant sections of the message. <IMAGE>

Description

SPECIFICATION Data compression method and apparatus This invention relates to a method and apparatus for compressing sequences of characters or other digital data for minimising storage requirements or reducing transmission times.

Compression of the data is a common requirement in the fields of data storage, processing and transmission for reasons of economy or efficiency. Methods such as Huffman coding work by assigning short codes to commonly occuring characters and longer codes to rarer ones, but are only helpful if the statistical distribution of characters in the file is what is was assumed to be. Methods which work by encoding common words such as "the" are even more liable to be fooled as, say, the file may contain French instead of English.

For that reason, methods have been devised whereby statistical distributions are buiit up during processing of the text to allow predictions to be made at any point and thus to recode the information in the light of those predictions, such as disclosed in U.K. Patent Application No. 8502324 in the name Codex Corporation.

A universal data cornpression system is described by Ziv,J. andtempel,A., in "A universal algorithm for sequential data compression" in IEEE transactions on information theory, volume IT-23, No. 3, May 1977. This system is based on an incremental parsing algorithm in which feature segments of the source-output are encoded via maximum length copying from a buffer containing the recent past output. The transmitted code word consists of the buffer address and the length of the copied segment. The parsing algorithm passes a source string into a collection of segments of gradually increasing length, based on the rule that starting with an empty segment, each new segment added to the collection is one- symbol longer that the longest match so far found.For example, the string 010100010 gets parsed as the collection (0,1,01,00,010).

When the parsed segments are retained in the same order as they are received, each segment can be encoded as the ordered pair (i,y,), where the index i, written as a binary integer, gives the position of the longest ear lier found matching segment in the collection, and y, gives the last added symbol. For example, the code of the segment "010" is, conceptually the pair (3,0,). Although the Ziv Lempel algorithm generally works well, the complexity of the implementation in terms of the number of stored items needed to generate the parsing trees grows beyond all bounds. Moreover, this technique requires that the length of the copied segment be encoded and this requires further encoding information and limits the amount of data compression.

An object of the present invention is to provide a data compression system of substantially universal applicability avoiding the complexity and slowness of the predection method, and limiting the amount of encoding required for efficient data compression.

This is achieved by automatically assigning identifying tokens to each piece of the data as it is processed, and using these tokens to reference later parts of the data. The encoding process can be adjusted for maximum compression or for faster throughput, in which case the partially compressed files can be processed later to compress them further. The decoding stage needs no prior knowledge either of the possible content of the data nor of the tuning parameter employed during the compression.

The compression ratio is usually much better than that achieved by the Ziv-Lempel method, but in the worst possible case, the same performance is obtained. When applied to data sequences containing a high degree of redundancy, e.g. in image compression, the invention can give a logarithmic compression factor for significant sections of the message.

Accordingly, in one aspect of the invention there is provided a method of compressing data comprising assigning an identifying code to each piece of data as it is processed, and using the identifying codes to reference later parts of data.

Conveniently, the data is digital data.

Preferably, said method is adjusted to vary the compression of data. Conveniently, the method is adjusted between one rate to provide maximum data compression and another rate to provide maximum data throughput, in which case the data is only partly compressed. Conveniently, said partially compressed data is further processed to provide fully compressed data or further data compression.

Accordingly, in another aspect -of the invention there is provided a method of encoding a stream of source characters into code words, said method comprising the steps of; passing said stream of source characters through a data compressor, creating a code for each successive combinable pair -of source characters and storing said created code in a memory, comparing said stored code with subsequent pairs and generating the same code of each corresponding source characters, repeating the above steps until the stream of source characters has been encoded.

Preferably said code words are decoded by using the reconstituted source characters of the earlier part of the stream as a dictionary to decode the later part.

Accordingly, in another aspect of the invention there is provided apparatus for compressing data comprising, compression means for compressing a data input by assigning an identifying code to each piece of data, first storage means coupled to said compression means for storing identifying codes corresponding to said compressed data, and an output encoder coupled to said compression means and to said first storage means for operating on a single token at a time, said output encoder sensing the quantity of data and encoding the data in accordance with a predetermined algorithm.

Preferably, a decoder is provided to decode information from the compressed data, said decoder comprising; an input decoder for expanding data into its tokenised form, second storage means coupled to said input decoder for storing an array of identifying codes as in said first storage means, an decompressor means coupled to said second storage means and to said input decoder for expanding the tokenised data into its original form.

In another aspect of the invention there is provided a data compressor comprising-, a compressor for compressing input data to a plurality of tokens representing combinations of input data, and a token store for storing partially and fully compressed data, said compressor and token store being adapted to be connected to an output encoder.

A method of compressing data as hereinbefore defined including the step of providing a search by means of linked lists for the required token. Alternatively the search can be provided by a direct sequential search. Conveniently, this is achieved with the addition of the output encoder and input decoder as described.

Preferably, the method also permits the use of alternative algorithms for output encoding and input decoding of the tokens which are obtainable from those described by linear transformations.

Optionally the compression store is primed by processing a selected known file before the main file is processed and the decompression token store is arranged to be primed with the same known file.

In another aspect of the invention there is provided a method of recognising a primary non-compound token by testing that its value is less than a unique end-of-file token. Accordingly, the initial primary of the RIGHT array is omitted.

In another aspect of the invention, the compressor is arranged, in physical terms, to travel along the token store instead of being constructed as a separate unit. This increases the effective working speed of the apparatus in that tokens exhausted bythe compressor are already present in their intended destination in the token store. Moreover, this arrangement enables one of the auxiliary stores to be dispensed with because the- locations used for right hand tokens in and above the compressor can be reused beneath it to perform the function of the digraph store. Moreover, the number of simple data flow operations within the compressor are mimimised because, at certain stages of operation, the compress or rises automatically instead of the data having to be forced along it.

In another aspect of the invention, the adjustment of the compressor is arranged to be varied automatically so that when highly compressible data is being processed, the stroke of the piston increasing so as to compress the tenuous input data stream as early as possible, and when the data is more full of information, the stroke is reduced so as to pass the irreducable data through the system as quickly as possible.

These and other aspects of the invention will become apparent from the following description when taken in combination with the accompanying drawings: Figure 1 is a schematic block diagram of a data compressor in accordance with an embodiment of the invention; Figure 2 is a diagrammatic and enlarged view of the primary compressor shown in Fig.

1; Figure 3 is a diagrammatic view of the token store depicted in Fig. 1; Figure 4 is a schematic block diagram of a decompression circuit in accordance with an embodiment of the invention; Figure 5 is a diagrammatic representation of the steps required in the method of compressing input data in accordance with an embodiment of the invention.

The encoding apparatus diagrammatically shown in Fig. 1 consists of a link 11 to the data store- 10 containing the original files, a primary compressor 12 which at any time contains uncompressed and partially compressed data as described more fully in Fig. 2, a token store 13 of fully compressed data which is re-read as well as written by the compressor 12, a final output encoder 14 and a link 15 to the data store 16 containing the resultant files. Dependent upon the application, the link 15 may be a telecommunications link to a distant receiver, or a direct connection to a local storage device. The store 16 may be distinct from store 10, or it can be the same physical store, in which case it is likely that each original file will be destroyed or deleted as soon as its compressed version is created.

The primary compressor, shown in Fig. 2 consists of an array of storage locations of equal capacity, each able to hold a token of maximum size, as described below. Data enters at port 20, one unit at a time, the unit being typically 1 byte (8 binary digits) or 1 nibble (4 binary digits) or a single bit (1 binary digit) or alternatively it may be a token produced by a previous rapid but incomplete compression. The length of the array is a ma jor consideration in tuning the performance of the device to the current needs of the user, and for that reason the position of port 20 can be varied along the physical length of the array.

Within the compressor is a variable length piston 21 whose function is to combine the pair of data units at its input face 22 and place the combined unit at its output face 23.

This is also a diaphragm 24 which is moved by the piston under circumstances described below. Finally there is the output port 25 from which data is vented at the appropriate time in the cycle.

The store Fig. 2 is of simpler construction, consisting of a pair of arrays 30, 31 of N storage locations, labelled 0 to N-1, each of which can hold a token whose value itself lies in the range 0 to N or -1 to N-i. There is also two auxiliary arrays, 32, 33 of equal size.

The primary compressor is capable of direct access to any location of all these arrays to perform reading of writing operations.

the output encoder 14 operates on only a single token at a time, but is also senses the quantity of data currently held in store 13 and uses this information in performing its function.

As best seen in Fig. 4 the decoding apparatus, consists of a link 45 from the data store 46 containing the compressed file, an input decoder 44 which expands the data into its tokenised form, a store 43 consisting of a pair of arrays identical to that shown as 30, 31, a decompressor 42 and a link 41 to the final data store 40. This apparatus is simpler than that of Fig. 1 as there is no analogue to the components 32, 33 and the decompressor 42 is much simpler than the compressor 12.

The efficiency of the decoding apparatus is independent of the tuning of the encoding apparatus and it is not necessary to furnish the decoding apparatus with this information.

Indeed if an uncompressedfile were processed by the decoding apparatus, it would pass through unaltered.

The apparatus constructed in one embodiment of standard data storage registers and arithmetic-logical units configured together for the purpose, or may be implemented in a conventional digital computer, either of single tasking or multi-tasking construction. For example, on a Maclntosh (Apple Inc) Microcomputer both in a high-level language, BCPL, and also in 68000 Assembler. It has also been implemented on a Nova microcomputer (Data General) in BCPL and in machine code and also on the Data General MV Series in BCPL.

The method by which the apparatus works is described with reference to Fig. 5 which shows successive stages of compression of a short sentence by way of example. The im portant feature to bear in mind is that, as soon as a pair of data are combined, whether they are primary data units, or already defined compound tokens, their combination is immediately available as a token in its own right to combine with further tokens. This means that the compressor "learns" as it goes along and the sum total of its knowledge at any stage is a vocabulary which is exactly that available to the decompressor at the corresponding stage of decompression, hence the process can proceed safely and efficiently.Furthermore, every compound token is composed only of earlier tokens in the vocabulary which enables the encoder 14 and decoder 44 to employ a particularly efficient algorithm to perform their functions.

The operations which are carrie-d out on and by the compressor 12 are as follows: (i) Initial load: charge the compressor fully with raw data from level 0 to Z, and set both the diaphragm and piston at the bottom Y=X=W=O. (Fig. 2).

(ii) Combination: if the data at Y and U+1 can be combined (e.g. T and H in THE) into an already defined token, then do so placing that token at X and raising the position of the piston by one X=X+1 and also increasing its length by one Y=Y+2. Should the diaphragm have been in contact with the piston (W=X) at the beginning of this step, then move the diaphragm down one step W=W-1 unless it is already at the output port (W=O).

(iii) Pass-through: if the data at Y and Y+ 1 cannot be combined then let the data at Y fall through the piston to reach position X and raise the piston above it X=X+1 and Y=Yt 1, and if the diaphragm is touching the piston (W=X), move it as well W=W+1. As a special case of this operation, we include the zero length piston (X=Y). The passthrough operation is applied automatically should only one datum be left above the piston (Y+1=Z).

(iv) Recharge: whenever the piston reaches the top of the compressor (Y=Z), it reverts to zero length Y=X and raw data is fed into the areas from Y to Z. Should the external supply of data become exhausted, then a special end-of-file datum is entered as often as necessary.

(v) Reset: if the lower surface of the piston reaches to top of the compressor (X=Y=Z), then the piston immediately is reset back to the position of the diaphragm Y=X=W.

(vi) Exhaust: when the diaphragm reaches the top of the compressor, the data is now fully compressed so the pair of data at 0 and 1 are exhausted and pass to the output encoder and to the store. The rest of the data settles down two steps and then the compressor is recharged. The piston and diaphragm are the repositioned as for the initial load Y=X=W=O.

(vii) Close down: as soon as an end-of-file datum has been exhausted, the operation of the compressor ceases.

A more rapid, but less complete, compression is available to the user by attaching the diaphragm 24 directly to the under-face 23 of the piston. In this case, step v is omitted and step ii is replaced as follows: (iia) Combination: if the data at Y and Y+1 can be combined into an already defined token, then do so and place that token at Y+1. If the lower face of the piston is at the output port (X=O), then raise the upper face of the piston Y=Y+1. If, however, the piston is already partially raised (X=O), then transfer the token at X-1 to just above the piston at position Y and move the lower face down on step X=X-1.

Referring to Fig. 3 in the token store 13, the positions 0 to L-1 represent primary data units, e.g. O to 9 if the input data stream contained only decimal digits, 0 to 255 if it contained 8 bit ASCII or EBCDIC characters and so on. Position L is used to represent the unique end-of-file token. Each position in the LEFT array 30 up to this level is filled with its own place value 0 to L, and each position in the RIGHT array 21 is cleared by filling with a distinctive value, say - 1 or N. The auxilliary arrays DIGRAPH and LINK are also cleared up to the same position L. The level indicator M initially is set at L.The operations performocl are: (viii) Define token: Whenever two data are exhausted by the compressor, then, as long as the store is not yet full (M=N-1), M is incremented M=M+ 1, the first datum J is stored at position M in LEFT and the second K at M in RIGHT. This process defines token M as consisting of a left hand subtoken J and a right hand subtoken K.

(ix) Link token: In order to provide rapid access for subsequent operations to the token store, then the content of the DIGRAPH array at position J is checked. If it is currently clear, then M is written there to denote that the token M is obtained as a digraph consisting of token J combined with another token. If however the digraph of J already exists, then it is necessary to look at that token and check its LINK field. Should that token already have a link field, then the apparatus inspects its link and if necessary its and so on until at last a vacant link is found.The value M is written there to denote that token M is not the first but is a subsequent digraph of token J defined following all the other digraphs of J already established. (It should be noted that the mode of operation ofthe compressor ensures that the store can never be called upon to define an already defined token). Both the LINK and DIGRAPH fields atM are cleared.

(x) Use token: In order for the piston to be able to condense two data J and K together, it must find them already defined as a compound token in the store. First of all it goes to position J and thence to its DIGRAPH if any. If there is no digraph then the store reports back that the requested token does not yet exist. If there is a digraph I, then the RIGHT value of I is compared with K If they agree then the store reports back that the required token is I, otherwise the LINK from I is checked in the same way. This process continues until either the token is found or an empty link field indicates that there is no such token.

Should the store become full, then 'use token' can be performed in the normal way, but all calls to 'define token' are simply ignored.

At the same time as feeding the token store, the exhausted data are sent to the output encoder 14. In fact, the main compression has already been achieved, and should the user wish for the quickest possible operation even at the expense of not quite so effective compression, then it is possible to bypass this part of the apparatus.

At any stage, the output encoder knows the current level M in the store and hence knows that the token that it receives are always in the range 0 to M-1. Let p=2k by the highest power of 2 which does not exceed M-1.

Then tokens in the range 0 to 2P-M-1 can be coded in k bits, while those from 2P-M to M-1 will require k+ 1 bits. A very convenient representation for token Q is as follows: if 0Q2P-M-1, transmit Q if 2P-MaQsP-1, transmit 0, then a zero bit if P < QaM-1, transmit 2P-Q, then a one bit.

On the assumption that all tokens in the range 0 to M-1 are equally iikely, the expectation value of the number of bits needed to transmit a token is k+2-2/Q, which averaged over the octave P < Qv2P-1 gives k+2-2 log82 or approximately k+0.606. If on the other hand, one were to assume that the probability of a token being found falls off linearly in the range 0 to M-1, then the average is found to be k+6-8 loge2 which is about K+0.4.

It should be emphasised that the user of this apparatus who has some prior knowledge of the type of message or is willing to add further apparatus to post-compress the data, may achieve even better overall compression.

This could enhance the performance of the current invention in special circumstances, but such modifications are peripheral to the apparatus and method. described here, which provide a universal technique for data compression.

Another optional feature would be to add check sums, cyclic redundancy checks or the like to the output data stream. This again is entirely at the descretion of the user and does not affect the novelty or effectiveness of this invention.

The primary decompressor 42 considers each token which it receives, separately. It takes the token value as a position in the store and reads the LEFT and RIGHT values there. First of all the right hand subtoken is inspected: if it has the illegal distinctive value, then the subdivision of the subtoken has reached its conclusion and the left hand value is passed; if however there is a valid right hand value then the left hand value has to be further subdivided. Once a left hand has actually been passed, the decompressor can turn its attention to the most recently put aside right hand value until finally all fragments of the original token have been generated. Alternatively, the left hand subtoken is inspected first and if it is less in value than the unique end-of-file token, then subdivision has reached its conclusion.

Immediately the distinctive end-of-file token is received (and it can never be part of a compound token) the process of regenerating the original file is complete.

The input decoder 44 works in the inverse way to the output encoder. It likewise sense the current store level M and the associated values P and k. It receives K bits to yield a signal R which it decodes to give a token as follows: if 0R2P-M-1, pass on R otherwise read 1 more bit.

if this is a zero, pass on R if it is a one, pass on 2P-R.

As each pair of tokens are sent to the decompressor 42, they are also added to the store 43 (unless it is already full). The level indicator M is incremented and the tokens are stored at position M in the arrays LEFT and RIGHT just as in the compressing apparatus.

However, here that is all that is done because there are no auxiliary arrays.

The apparatus andx method will work well with any data file which has structure or is concerned about a topic so that it has redundancy or holds repeated instances of the same subsequence of data. Thus in this document the word "the" is common as in all English texts, but sa is the word "data". Only if the file were full of random noise would the compression fail to work. In a trial of the apparatus and method on half megabyte file, data has been compressed by about 75% ie.

1.5Mb down to 400k bytes in a spread sheet file and reconstituted to its original volume without error. In a further trial data has been transmitted on a line physically capable of 9600 baud but at an effective data transmis sion rate of 30,000 band ie. an increase of a factor of about 3.

The system learns very rapidly and every repetition of a subsequence is compressed into half the number of tokens that it tool last time it was found. Even if a familiar phrase appears in the midst of a new strange context, it will be instantly recognised and compressed. Should there be runs of repeated data, for example, in image data, then the compression rapidly becomes logarithmic for those sections, limited only by the capacity Z of the primary compression to a compression ratio of 2z to 1.

In some methods of data compression, a selected dictionary of code words is passed as the first part of the file. There would be no advantage of doing this explicitly in the present invention as the optimum dictionary is actually the file itself. However, in some applications, it could be useful for both the compressing and decompressing stages to have access to a set of fixed files which can be used to prime their token stores. If the choice of priming file is passed as a simple reference number, then there will be an overall saving if one primary file is appropriate for several data files.

A similar method could be used when the system is being used to compress the data prior to transmission to a remote site and it is wished to encrypt the message. Because of the lack of any simple relationship between the lengths of a section of the data in its original and compressed forms, the system can be set up so as to distort any inherent frequency distribution of characters, words or whole phrases. A simple permutation operation may then be applied to each reference to an existing token so as to encrypt the message in a very effective way.

It will be understood that various modifications to the compression technique hereinbefore described may be implemented although the same decompression technique and outputting arrangement would be used, for example, in the aspect of compression in which the cycle is varied, this may be realised in a number of different ways each of which may be considered as giving a different choice between speed of operation and degree of compression. Nevertheless, the principal embodiment described hereinbefore achieves the most thorough compression of any of the methods.

In the first of that variants, the length of the compressor is fixed at two data units. If the data can be combined, then this is done and another datum is fed in above the newly combined token. This process is repeated until the pair is not recognised as an already defined token. In the aspect of the invention in which the compressor rides up upon the data store, conveniently the pairs of data immediately constitute their own definition and the compressor rises to attempt to compress the next two data.

In the second of these variants, the piston rises wherever a successful combination is made. Immediately a combination cannot be made, the lowest two tokens in the compres sor (ie. not necessarily those currently in the piston) are defined as a new token in a like manner to the first variant.

In the third of these variants, the action is similar, save that upon failure the piston falls back one place, and only when the piston is back at its lowest point of travel will a failure result directly in the exhausting of the two data in the form of a newly defined token.

In the fourth of these variants, the action is similar to that in the third, except that on any failure the piston immediately reverts to its lowest position where another attempt is made.

In the fifth of these variants, the piston rises upon success in a like manner, but the behaviour upon failure depends on the current position of the piston as follows: if the piston is at its lowest point it rises as if there had been a successful combination; if it is already one step above its lowest point then the underlying data are exhausted and defined; in any other position it falls back to its lowest point.

In the sixth of these variants, the action is similar to that in the field save that the exhaust may only take place if the failure at level one immediately follows a failure at the base level, otherwise the piston falls and another attempt is made.

In the seventh of these variants, -unlike any of the others, the piston start to cycle in the raised position so as to attempt to combine data items 2 and 3. Whether or not this succeeds it than falls to combine items 2 and 2, the current 2 being either the original 2 or the token representing the combination of 2 with 3 as the case may be. If the combination of the lowest data items succeeds then the whole cycle is repeated, but upon failure, then the data is exhausted and the apparatus is recharged.

In summary, it will be understood that the present invention includes a number of different variants in different embodiments which differ mainly in the order in which the combination process is attempted on the data in the compressor. the relative performance of the variants depends on the exact content of the input data stream, but measurements have shown that the first variant is usually the most rapid but correspondingly the least complete in its compression. Variant three is somewhat slower but more complete. Variant four is usually a little slower still but is more complete again. Variant six is the slowest but almost always is most effective in compressing the data. The other variants have been found in most trials to be somewhat less successful to these four, but still in certain circumstances it is possible for them to give a more favourable performance.

Claims

1. A method of compressing data comprising assigning an identifying code to each piece of data as it is processed, and using the identifying codes to reference later parts of data.

2. A method as claimed in claim 1 wherein said data is digital data.

3. A method as claimed in claim 1 or claim 2 wherein the method is adjusted between one rate to provide maximum data compression and another rate to provide maximum data throughput, in which case the data is only partly compressed.

4. A method as claimed in any preceding claim wherein said partially compressed data is further processed to provide fully compressed data or further data compression.

5. A method of encoding a stream of source characters into code words, said method comprising the steps of; passing said stream of source characters through a data compressor; creating a code for each successive combinable pair of source characters and storing said created code in a memory; comparing said stored code with subsequent pairs and generating the same code of each corresponding source characters; repeating the above steps until the stream'of source characters has been encoded.

6. A method as claimed in claim 5 wherein said code words are decoded by using the reconstituted source characters of the earlier part of the stream as a dictionary to decode the later part.

7. Apparatus for compressing data comprising, compression means for compressing a data input by assigning an identifying code to each piece of data, first storage means coupled to said compression means for storing identifying codes corresponding to said compressed data, and an output encoder coupled to said compression means and to said first storage means for operating on a single token at a time, said output encoder sensing the quantity of data and encoding the data in accordance with a predetermined algorithm.

8. Apparatus as claimed in claim 7 wherein a decoder is provided to decode information from the compressed data, said decoder comprising; an input decoder for expanding data into its tokenised form, second storage means coupled to said input decoder for storing an array of identifying codes as in said first storage means and decompressor means coupled to said second storage means and to said input decoder for expanding the tokenised data into its original form.

9. A data compressor comprising, a compressor for compressing input data to a plurality of tokens representing combinations of input data and a token store for storing partially and fully compressed data, said compressor and token store being adapted to be connected to an output encoder.

10. A method of compressing data as hereinbefore defined including- the step of providing a search by means of linked lists for the required token.

11. A method as claimed in claim 10 wherein the search can be provided by a direct sequential search.

12. A method as claimed in claim 10 or 11 wherein this is achieved with the addition of the output encoder and input decoder as described.

13. A method as claimed in any one of claims 10 to 12 wherein the method also permits the use of alternative algorithms for output encoding and input decoding of the tokens which are obtainable from those described by linear transformations.

14. A method as claimed in any one of claims 10 to 13 wherein the compression store is primed by processing a selected known file before the main file is processed and the decompression token store is arranged to be primed with the same known file.

15. A method of recognising a primary non-compound token in a token store having a left array and a right array, said token store having 0 to 1-1 positions representing primary data units, and position L representing a unique end-of-file token, each position in the left array being filled with it own place value O to L and each position in the right array being filed with a distinctive value, said method consisting of testing the non-compound token value is less than the unique end-of-file token and if so, omitting the initial primary data unit of said right array.

16. A data compressor as claimed in claim 9 wherein the compressor is arranged to travel along the token store so that tokens exhausted by the compressor are already in their intended destination.

17. A data compressor as claimed in claim 9 wherein the adjustment of the compressor is arranged to be varied automatically so that when highly compressible data is being processed, the stroke of the piston increases so as to compress the tenuous input data stream as early as possible, and when the data is more full of information the stroke is reduced so as to pass the irreducible data through the system as quickly as possible.

18. Apparatus substantially as hereinbefore described with reference to the accompanying drawings.

19. A method of compressing data substantially as hereinbefore described.

20. A data compressor substantially as hereinbefore described with reference to the accompanying drawings.