AN EFFICIENT, LOCALLY-ADAPTIVE DATA REDUCTION METHOD AND APPARATUS
Field of the Invention
The present invention relates to data reduction methods and, in particular, to methods and apparatus for efficiently reducing data in a locally-adaptive manner.
Background of the Invention Information processing systems and data transmission systems frequently need to store large amounts of digital data in a mass memory device or to transfer large amounts of digital data using a resource which may only carry a limited amount of data at a time, such as a communications channel. Therefore approaches have been developed to increase the amount of data that can be stored in memory and to increase the information carrying capacity of capacity limited resources. Most conventional approaches to realize such increases are costly items of equipment or monetary expense, because they require the installation of additional resources or the physical improvement of existing resources. Data reduction, in contrast with other conventional approaches, provides such increases without incurring large costs. In particular, it does not require the installation of additional resources or the physical improvement of existing resources.
Data reduction methods and apparatus remove redundancy from an input data stream, while still preserving the information content. An input data stream can be a stream of data to be transmitted or a file to be compressed, and the input data stream is sometimes referred to as an alphabet A of symbols. The data reduction methods and apparatus which are the greatest interest are those which are fully reversible, such that an original data stream may be reconstructed from reduced data without any loss of information content. Techniques, such as filtering, which are not fully reversible, are sometimes suitable for reducing the size of visual images or sound data. They are, nevertheless, not suitable for reductions of program image files, textual report files and the like, because the information content of such files must be preserved exactly. There are two major goals in digital data reduction. The first goal is to maximize reduction by using the fewest possible bits to represent a given quantity of input data. The second goal is to minimize the resources required to perform reduction and reconstruction. The second goal encompasses such objectives as minimizing computation time and minimizing the amount of memory required to reduce and reconstruct the data. Data reduction methods of the prior art typically achieve only one these goals.
For example, one reduction technique is "move -to-front" coding. The basic idea of this method is to maintain the alphabet A of symbols as a list where frequently occurring symbols are located near the front. A symbol "s" is encoded as the number of symbols that precede it in this list. Thus if A=("a", "m", "o", "n", . . .) and the next symbol in the input stream to be encoded is "n", it will be encoded as "3", since it is preceded by three other symbols. This encoding may be supplied to any appropriate standard variable-size bit-encoding scheme, such as Huffman encoding. After symbol "n" is encoded, it is stored (or moved if it was previously stored) at the front of A. Thus, after encoding "n" the list is modified to A=("n", "a", "m". "o"). This move-to-front step reflects the hope that once "n" has been read from the input stream, it will be read many more times and will, at least for a while, be a common symbol. The move-to- front method is locally adaptive since it adapts itself to the frequencies of symbols in local areas of the input stream. Unfortunately, this technique consumes a large amount of computational resources when the list A is large because each symbol in the list must be moved when a new symbol is brought to the front of list A. A variation of move-to-front coding, called "move-ahead-k," attempts to solve the computational efficiency problems of move-to-front coding by moving a list element matched by the current symbol k positions, instead of all the way to the front of list A. The parameter k can be specified by the user, with a default value of either n or 1. However, this variation does not eliminate moving multiple list elements. The computational load, therefore, may not be significantly lessened.
A second variation of move-to-front coding attempts to lessen computational load by moving an element of the list A to the front only after it has been matched c times to symbols from the input stream (not necessarily c consecutive times). In this variation, known as "wait-c- and-move," each element of A should have a counter associated with it, to count the number of matches. As above, however, this variation does not avoid shuffling the element of the list A and the concomitant computational load, but merely delays it.
Summary of the Invention
The present invention provides efficient data reduction systems and apparatus that exhibit the desirable properties of the move-to-front reduction schemes discussed above while avoiding their computational intensity. The present invention therefore provides data reduction techniques and apparatus that operate quickly and with minimal computational load.
In one aspect the invention relates to an apparatus for efficiently reducing data which includes a token cache, a comparator, and a repositioning mechanism. The token cache stores a plurality of processed tokens. The comparator receives as input a first token to be processed and provides as output an indication that the first token is stored by the cache at a first position. The repositioning mechanism swaps the first token with a second token stored in a token cache. The second token is selected in response to the position of the first token.
In another aspect, the present invention relates to a reducer which can be used in a system for efficiently reducing data. The reducer includes a token cache, a receiver, a decoder, and a repositioning mechanism. The token cache stores a plurality of process tokens. The receiver receives token position information. The decoder accepts as input the received position information and provides as output a first token corresponding to the position information. The repositioning mechanism swaps the first token with a second token stored in the token cache. The second token is selected responsive to the received position infoimation.
In yet another aspect, the invention relates to a method for efficiently reducing data. The method includes the steps of determining the position occupied by a first data token in a token cache, selecting a second data token stored in the token cache, and swapping the first token and the second token. The second data token is selected responsive to the position of the first token.
In yet another aspect, the invention relates to a method for receiving transmitted data in system for efficiently transmitting data. The method includes the steps of receiving a transmitted token position indicator, identifying a first token stored in a token cache using the received token position indicator, determining a second token stored in the token cache based on the position of the first token, and swapping the first token and the second token. Brief Description of the Drawings
The invention is pointed out with particularity in the appended claims. The advantages of the invention described above, and further advantages, may be better understood by reference to the following description taken in conjunction with the accompanying drawings, in which: FIG. 1 is a block diagram of one embodiment of a data reduction apparatus; FIG. 2 is a flowchart of one embodiment of the steps taken reduce data; FIG. 3 A is a logic diagram of one embodiment of a comparator as used in the present invention;
FIG. 3B is a block diagram of an embodiment of token switching logic;
FIG. 4 is a block diagram of an exemplary system using the apparatus and method of the present invention to efficiently transmit data; and
FIG. 5 is a diagram illustrating the components of a general purpose computer.
Detailed Description of the Invention Throughout the Specification, reference will be made interchangeably to "data tokens" or
"symbols". A data token or symbol is any conveniently-sized datum in which the described technique may find utility. Thus, a data token may be a 4-bit nibble, a 8-bit byte, a 16-bit word, a 32-bit longword, or some other conveniently sized datum.
Referring now to FIG. 1 , an apparatus 10 for efficiently reducing data includes a token cache 12, a comparator 14, and a repositioning mechanism 16. The token cache 12 stores data tokens. For simplicity, reference throughout will be made to the token cache storing tokens, although it should be understood to include embodiments in which the token cache stores representations of tokens. The comparator 14 compares tokens or symbols from an input stream to the token cache 12, and provides to the repositioning mechanism 16 an indication whether a representation of the current token is stored in the token cache 12. The input stream may be a stream of data tokens to be transmitted, or it may be a file to be reduced. The apparatus outputs a string of encodings which can be used by a decoding unit to reconstruct the reduced information. Referring also to FIG. 2, the steps taken by the apparatus 10 to efficiently reduce data are shown. A data token to be processed is received (step 102). The data token may be received as one of a stream of tokens to be transmitted over a communications channel, such as a local area network connection, a wide area network connection, or a wireless network connection. In some embodiments, the data token may be accessed from a buffer memory (not shown in FIG. 1) in which received tokens are stored before being processed. The buffer memory may be one or more transceivers embodied as integrated circuits. Alternatively, the data token may be accessed from a file to be reduced using the method of the invention. In general, use of the term
"received" is intended to refer to any method of accessing a data token for reduction, whether or not it is buffered before processing.
Once the data token is received (step 102), the comparator 14 determines whether a representation of the received data token is present in the token cache 12 (step 104). In one embodiment, the comparator 14 makes this determination by comparing the received data token with every entry in the token cache 12. In other embodiments, the token cache 12 includes two
arrays, one of which stores token encodings and the other of which stores token decodings. An encoding array maps tokens to their respective positions in the output list. A decoding array maps positions in the output list to tokens. In these embodiments, the encoding array may be provided with a flag that indicates whether a token is contained in the token cache 12. The comparator 14 accesses the appropriate element of the encoding array to determine whether the received data token is stored in the token cache and, if so, what position it occupies in the output list. For example, an encoding for token "c" could be stored in the third element of an array. Alternatively, the token encoding could be stored in the 63rd element of an array, which corresponds to the ASCII encoding of "c." When a "c" token is received, the comparator 14 refers to the encoding array to determine if "c" is stored in the token cache, the encoding array may additionally provide the comparator 14 with the position of the token "c".
If a received data token is stored by the token cache 12, the position of the data token in the token cache 12 is determined (step 106). This determination may be made by the comparator 14 when it determines if the data token representation is stored in the token cache 12. Alternatively, a separate functional unit may be provided which independently determines the position of the token representation.
Once the position of the data token is determined, a second data token is selected (step 108) and the received data token is swapped with the second data token stored in the token cache 12 (step 110). Selection of the second token is made responsively to the position of the received data token. For example, the second token may have a position equal to three-quarter, one-half, one-quarter, one-eighth, or one-sixteenth the current position of the received data token. That is, the second token may be closer to the head of the list by three-quarters, one-half, one-quarter, one-eighth, or one-sixteenth the position of the first token. Selection of the second data token may be made in response to current performance characteristics of the data reduction. For example, the apparatus 10 may determine that data reduction achieved by swapping received tokens with data tokens at one-half their position is not acceptable and can begin using some other rule, such as three-quarters, to attempt to improve performance. Once the tokens are swapped, the next token in the input stream is processed until no more tokens exist in the stream. Referring back to step 104, if the comparator 14 determines that a received data token is not stored in the token cache 12, a second data token is selected (step 120). The second data token may be selected using a pseudorandom number generator. In some embodiments, the
second data token is selected using the bits from successive numbers in the Fibonacci number sequence. Alternatively, a subset of those bits, such as the bottom three or top five, may be used. In other embodiments, well-known cache management techniques such as most-recently-used (MRU) or least-recently-used (LRU) may be used to select the second data token. The selected second data token is replaced by the received data token (step 122), and the next token in the input stream is processed until no more tokens exist in the stream.
Referring back to FIG. 1, and in greater detail, the apparatus 10 for performing the methods described above may be provided as hardware or software executing on a general- purpose computer. The token cache 12 stores received data tokens and may be implemented as any convenient memory element or memory data structure. For example, if tokens are 8-bit bytes, the token cache 12 may be implemented as a byte- wide memory chip such as SRAM. DRAM, SDRAM, or flash memory. Alternatively, the token cache 12 may be implemented as an array memory structure in which data tokens are stored. In these embodiments, the array memory structure matches the size of the tokens. In one advantageous embodiment, the token cache 12 is implemented as two arrays. One array stores encodings of data tokens, that is, the array maps tokens to their position in the output list. A second array, which corresponds to the first array, stores the decodings of positions, that is, it maps positions in an output list to the corresponding token value.
As noted above, the comparator 14 compares received data tokens to the token cache 12. In software, the comparison of the received data token to the token cache 12 is effected by comparing the received data token to every element in the token cache 12. FIG. 3 depicts an embodiment of the comparator 14 which combinatorially compares a received data token to the token cache 12. In the embodiment shown in FIG. 3, for simplicity, a 4-bit nibble is shown as the token size and only the block of logic 30 required to compare one entry in the token cache with a received data token is shown.
In the embodiment shown in FIG. 3, a received data token is stored in a buffer element 32. Circuitry 34 compares each bit of the received data token with the corresponding bit of an element in the token cache. In the embodiment depicted in FIG. 3, comparison circuitry 34 includes 2 AND gates 35, 36 and an OR gate 37. The bits to be compared are delivered to the inputs of AND gate 35. The output of this gate is positive only when both bits are equal to a logical "1" value. The inversion of each bit is delivered to the inputs of AND gate 36. The
output of AND gate 36 is high only when both bits are a logic "0". The output of AND gates 35, 36 are connected to the inputs of OR gate 37, which outputs a logic "1" if the bits to be compared are both "0" or "1". The result of each individual comparison, that is, the output of each OR gate 37 is combined with the result of other OR gates in the logic block 30. The output of AND gate 38 is a logic "1" only when each and every output of OR gates 37 are logic "1". A logic "1" output from AND gate 38 indicates that the received data token stored in buffer memory element 32 matches a token stored in the token cache 12. In the embodiment shown in FIG. 3, the output of the respective AND gates 38 can be used to determine the position of the matching data token. Although only one logic block, which compares one token with a single received data token, is shown in FIG. 3, it will be readily apparent to one of ordinary skill in the art how to extend the embodiment shown to accommodate an entire token cache or different token sizes.
As noted above, if the received data token is stored in the token cache 12, a second data token is selected and the positions of those tokens are swapped. FIG. 3B shows a block diagram of a hardware implementation for swapping two data tokens in the token cache 12. In the embodiment shown in FIG. 3B, the token cache 12 is implemented a plurality of token-wide latches that can be simultaneously, selectively read. The outputs of the token cache 12 are fed back to the inputs of the token cache 12 through a crossbar switching element 39. The crossbar switching logic 39 allows any output of the token cache 12 to be fed to any input of the token cache 12. This implementation allows data tokens to be swapped in one clock cycle. In another implementation, the token cache 12 is provided as token-wide RAM memory elements. In this embodiment, each token cache location is read and the output is stored in a latch element. After both entries are read, they are written back to the memory element using the address of the other token entry. The addresses may be latched to associate them with the token entry. In other embodiments, the token cache 12 may be provided as dual-port RAM to allow the token cache 12 to be written and read at the same time.
For embodiments in which the received data token is not stored in the token cache 12, the circuitry described above in relation to FIG. 3B may still be used, provided that the selected token to be removed from the token cache 12 is not fed back to the inputs of the cache 12.
EXAMPLE
The following example is one way of using the invention. The following example is meant to explain one way in which the invention can be used and should not be used to unduly limit the invention.
Referring now to FIG. 4, a system 40 for transmitting reduced data is shown and includes a transmitter 42 and a receiver 44. The transmitter 42 and the receiver 44 communicate over a communications channel 46 that may be a local area network connection or a wide area network connection. Communications channel 46 may use any suitable communications protocol such as
Ethernet, TCP/IP. or ATM. Channel 46 may be a wireless connection.
The transmitter 42 includes a token cache 12, a comparator 14, and a repositioning mechanism 16. In this example, the token cache 12 is provided as two 256-byte arrays. One of the arrays will be referred to as Encode and the other as Decode, and their entries satisfy the property:
Decode[Encode[i]] = i for all 0<= i <= 255. The Encode array and the Decode array may be initialized such that Encode[i] == i and Decode[i] == i for all elements I.
When a data token T is received by the comparator 14, whether read from a file or received over a communications channel, the comparator 14 accesses the Encode array to determine the current encoding for token T. That encoding is provided to transceiver 48, which transmits the encoding over the communications channel 46. The comparator 14 determines with which token the received data token should be swapped, and the repositioning mechanism
16 performs the swap. In software, the actions of the transmitter may be modeled as follows: e = Encode[T] /* What's the current encoding for token T? */
Output[e] /* Transmit encoding */ half_e = e/2 /* What encoding should token T have next time? */ X=Decode[half_e] /* What token has the future encoding of T */
Decode[e] = X /* Swap tokens */
Decode[half_e] = T Encode [X] = e Encode[T] = half_e The receiver 44 reconstructs T from e using its own copy of the token cache 12 constructed as two arrays, Encode2 and Decode2 (not shown in FIG. 4). The receiver 44 performs the following steps:
Receive[e] /* Receive encoding */
T= Decode2[e] /* What's the current encoding for token T? */
half_e = e/2 /* What encoding should token T have next time? */
X=Decode2[half_e] /* What token has the future encoding of T */
Decode2[e] = X /* Swap tokens */
Decode2[half_e] = T Encode2[X] =e
Encode2[T] = half_e
For embodiments in which the invention is provided as software, the program may be written in any one of a number of high level languages such as FORTRAN, PASCAL, JAVA, C, C++, or BASIC. Additionally, the software could be implemented in an assembly language directed to the microprocessor resident on the target computer, for example, the software could be implemented in Intel 80x86 assembly language if it were configured to run on an IBM PC or PC clone. The software may be embodied on an article of manufacture including, but not limited to, a floppy disk, a hard disk, an optical disk, a magnetic tape, a PROM, an EPROM, EEPROM, field-programmable gate array, or CD-ROM. In these embodiments, the software may be configured to run on any personal -type computer or workstation such as a PC or PC-compatible machine, an Apple Macintosh, a Sun workstation, etc. In general, any device could be used as long as it is able to perform all of the functions and capabilities described herein. The particular type of computer or workstation is not central to the invention. Referring to FIG. 5, the computer 500 typically will include a central processor 520, a main memory unit 522 for storing programs and/or data, an input/output (I/O) controller 524, a display device 526, and a data bus 528 coupling these components to allow communication therebetween. The memory 522 includes random access memory (RAM) and read only memory (ROM) chips. The computer 500 typically also has one or more input devices 530 such as a keyboard 532 (e.g., an alphanumeric keyboard and/or a musical keyboard), a mouse 534, and, in some embodiments, a joystick 536.
The computer 500 typically also has a hard drive 550 with hard disks therein and a floppy drive 552 for receiving floppy disks such as 3.5 inch disks. Other devices 560 also can be part of the computer 500 including output devices (e.g., printer or plotter) and/or optical disk drives for receiving and reading digital data on a CD-ROM. In the disclosed embodiment, one or more computer programs define the operational capabilities of the system 500, as mentioned previously. These programs can be loaded onto the hard drive 550 and/or into the memory 522 of the computer 500 via the floppy drive 552. In general, the controlling software program(s)
and all of the data utilized by the program(s) are stored on one or more of the computer's storage mediums such as the hard drive 550, CD-ROM, etc. In general, the programs implement the invention on the computer 500, and the programs either contain or access the data needed to implement all of the functionality of the invention on the computer 500.
Having described certain embodiments of the invention, it will now become apparent to one of ordinary skill in the art that other embodiments incorporating the concepts of the invention may be used. Therefore, the invention should not be limited to certain embodiments, but rather should be limited only by the spirit and scope of the following claims.