WO1994027374A1

WO1994027374A1 - Method and apparatus for efficient compression of data having redundant characteristics

Info

Publication number: WO1994027374A1
Application number: PCT/US1994/005320
Authority: WO
Inventors: Ke-Chiang Chu; Daniel J. Culbert
Original assignee: Apple Computer, Inc.
Priority date: 1993-05-13
Filing date: 1994-05-13
Publication date: 1994-11-24
Also published as: AU6911894A

Abstract

A method and apparatus for compressing inherently redundant data. A Unicode file is comprised of prefix group indicator bytes and suffix character indicator bytes and can therefore be separated into two files, one containing the prefixes and one containing the suffix characters. Then, each separate file can be separately compressed using means best suited to the characteristics of each. Because of the high degree of redundancy across the prefix group indicator bytes they can be more greatly compressed which in turn results in greater compression of the entire Unicode file. Multiple compression methodologies, equally applicable to any inherently redundant data file, can be applied to the prefix group indicator bytes to yield the best compression results.

Description

METHOD AND APPARATUS FOR EFFICIENT COMPRESSION OF DATA HAVING REDUNDANT CHARACTERISTICS

FIELD OF THE INVENTION

The present invention relates generally to the field of data compression, and more particularly to compression of data having inherently redundant characteristics.

BACKGROUND OF THE INVENTION

In the field of data processing, an 8 bit byte is the traditional unit of computer data. Typically, individual characters in a file or data set are separately denoted and stored as single bytes. Commonly known and used single-byte character formats include American Standard Code for

Information Interchange (ASCII) and Extended Binary Coded Decimal Interchange Code (EBCDIC).

There are however, problems or difficulties which have arisen due to limitations of the 8 bit byte character format. First of all, with only 8 bits per character, there can be only 256 (2 to the 8th power) different characters represented. While 256 characters is generally sufficient for languages such as English, it is generally insufficient for other languages such as Kanji or Chinese. Secondly, 8 bits (256 characters) is generally insufficient to represent a combined language environment, such as English characters intermixed with math symbols and/or control characters.

Furthermore, software vendors have been forced to make 'localized software' when distributing software in multiple countries because 256 characters is generally insufficient to support all of the different characters needed for all of the different languages of those countries. Completion, maintenance and support of localized software can be a tremendous undertaking. Thus, the typical single byte character formats are inadequate in an increasingly complex global computing environment.

For these reasons, a new character format or standard has emerged known as Unicode. As is well known and is explained in "The Unicode Standard, Worldwide Character Encoding" Version 1.0, Volume One, Copyright 1990, 1991 Unicode, Inc., Unicode is a fixed-width, uniform text and character encoding scheme utilizing a 16-bit architecture which extends the benefits of ASCII to multilingual text. Unicode characters are consistently 16 bits wide, regardless of language, so no escape sequence or control code is generally required to specify any character in any language. Unicode character encoding treats symbols, alphabetic characters, and ideographic characters identically, so that they can be used simultaneously and with equal facility.

Because there are 16 bits per character, it is possible to represent up to 65,536 (2 to the 16th power) different characters with Unicode. The Unicode standard currently contains over 28,000 characters, including 2,300 general (alphabetic or syllabic) letters, 1,200 textual symbols, and 3,300 CJK (Chinese/Japanese/Korean) phonetics, punctuation, symbols, Korean

Hangul syllables and over 20,000 Han characters.

The Unicode format, as stated above, utilizes 16 bits for each character represented. Referring now to Figure 1, the format of a single generic Unicode character of 16 bits is shown. The first half (first 8 bits), or prefix, of each character represented in Unicode is an indicator of the group (e.g., math symbol, Kanji, English, etc.) of the particular character being represented- The second half (second 8 bits), or suffix, of each character represented in Unicode indicates which particular character within the indicated group is being represented.

Unicode can therefore more easily represent a variety of characters in a single document or file without requiring specialized or localized software. However, the storage overhead of Unicode data is, by definition, larger than with 8-bit character formats because Unicode data uses 16 bits per character. Thus, documents or data files stored in the Unicode format are generally twice as large as would be the same documents or data files represented in ASCII, for example. There is therefore a need to reduce the increased size of Unicode files while still retaining the ability to represent the range of characters supported by the Unicode character format. Typical compression methodologies handle uncompressed data on a byte-by-byte basis. Compressing data on a byte-by-byte basis generally works well for data which is comprised of characters stored in a single byte per character format. Referring now to Figure 5, an example compression method which is well known in the art processes an uncompressed input data stream 10 to generate a compressed data output stream 20 by comparing an uncompressed portion 13 of input data stream 10 to data in a history buffer 11 of already processed input data. If a matching data string 12 is located in history buffer 11 for current data string 14, data string 14 is encoded in compressed data stream 20 as a pointer (p_G, lo) 24, corresponding to an offset p₀ 15 and a data length 1₀ 16. The shorter length data of pointer (po, lo) 24 thus replaces longer data string 14 in output compressed data stream 20.

Unfortunately, such prior art compression approaches do not work as well with Unicode data because each character in the Unicode format is comprised of two bytes. One problem is the greater time needed to compress Unicode data. This is because Unicode data comprises suffix character data interspersed with prefix group indicators and hence, in general, more bytes have to be scanned in order to find each match.

Another problem is the general doubling of the value of the resulting length (1) and offset (p) values to what is the equivalent matching character as would have occurred in a non-Unicode data format. Doubling the value of either the length (1) or offset (p) values results in a decreased compression ratio, an undesirable side effect.

A still further problem is the increased difficulty in finding matching strings when the current data string to be matched occurs at a break between a prefix group indicator and its associated suffix character due to the previous matching string. In that situation, one is no longer merely trying to match a character and its associated prefix group code with an earlier character and its associated prefix group code. Instead one is trying to match a character and a following prefix group code with an earlier occurrence of the same character and the same following prefix group code. There is less likelihood of finding such a match and so this too results in a decreased compression ratio. Thus, an improved compression methodology is needed to handle the larger data files of the two-byte-per-character Unicode format.

SUMMARY AND OBTECTS OF THE INVENTION

An objective of the present invention is to provide an improved method and apparatus for efficient compression of data.

Another objective of the present invention is to provide an improved method and apparatus for efficient compression of data having redundant characteristics.

A still further objective of the present invention is to provide an improved method and apparatus for efficient compression of data stored in a Unicode character format.

The foregoing and other advantages are provided by a method for compressing Unicode data comprising separating the prefix group indicator portion of the Unicode data from the suffix character portion of the Unicode data and separately compressing the prefix group indicator portion of the Unicode data and the suffix character portion of the Unicode data.

The foregoing and other advantages are further provided by a compression method wherein the prefix group indicator portion of the Unicode data is separately compressed by different compression methodologies and then using the compression methodology which yielded the best compression ratio with the prefix group indicator portion of the Unicode data.

The foregoing and other advantages are also provided by an apparatus for compressing Unicode data comprising processor means for separating the prefix group indicator portion of the Unicode data from the suffix character portion of the Unicode data, processor means for separately compressing the prefix group indicator portion of the Unicode data and the suffix character portion of the Unicode data, and memory means for storing the compressed prefix group indicator portion of the Unicode data and the compressed suffix character portion of the Unicode data.

One advantage of the present invention is improved compression /decompression of Unicode data. Another advantage of the present invention is improved compression /decompression of data stored in a two-byte per character format.

Still another advantage of the present invention is improved compression /decompression of data which is inherently redundant.

Other objects, features and advantages of the present invention will be apparent from the accompanying drawings and from the detailed description which follows.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which:

Fig. 1 depicts the format of a single generic Unicode character;

Fig. 2 is a generalized block diagram of a typical computer system which might utilize the present invention;

Fig. 3 depicts a typical sequence of characters stored in the two-byte- per-character Unicode format;

Fig. 4 is a block diagram of the compression and decompression approach of the present invention; and

Fig. 5 depicts an example compression and decompression approach of the prior art.

DETAILED DESCRIPTION OF THE INVENTION

Figure 2 is a generalized block diagram of a typical computer system 210 which might utilize the present invention. Computer system 210 includes a CPU/memory unit 211 that generally comprises a microprocessor, related logic circuitry, and memory circuitry. Input device 213 provides input to the CPU/memory unit 211, which by way of example can be a keyboard, a mouse, a trackball, a joystick, a stylus, a touch screen, a touch tablet, etc., or any combination thereof. External storage 217, which can include fixed disk drives, floppy disk drives, memory cards, etc., is used for mass storage of programs and data. Display output is provided by display 219, which by way of example can be a video display or a liquid crystal display. Note that for some configurations of computer system 210, input device 213 and display 219 may be one and the same, e.g., display 219 may also be a tablet which can be pressed or written on for input purposes.

As has been explained, the Unicode standard stores data in a two-byte- per-character format. And because traditional compression methods do not properly handle data stored in multiple bytes-per-character, the increased file size of Unicode data competes with the benefits of using the Unicode format. The present invention overcomes the limitations of the traditional compression methods while still supporting the benefits gained through use of the Unicode standard.

Referring now to Figure 3, a typical sequence of two-byte characters stored in a Unicode format can be seen. Bytes 1, 3 and 5 of the character sequence are the prefix group indicator bytes while bytes 2, 4 and 6 are the suffix characters themselves (see discussion with reference to Figure 1).

It is important to note here that while the prefix group indicator bytes

1, 3 and 5 of Figure 3 could be different from each other, because most documents tend to use one language there tends to be a high degree of redundancy in the group indicator bytes within a single document or file stored in the Unicode format. In other words, because a single document tends to be written primarily in a single language (e.g., English, Japanese, etc.) with a fewer number of other types of characters (e.g., control, mathematics symbols, etc.) intermixed, there tends to be a high degree of redundancy across the group indicator bytes within a file. This inherent redundancy of Unicode data (where generally every other byte is a group indicator byte) can be greatly utilized to improve compression of such data, as will be explained more fully herein.

Referring now to Figure 4, the compression methodology of the preferred embodiment of the present invention will now be explained. Uncompressed Unicode data 401 is first separated into two blocks or files: one block or file containing the inherently redundant prefix group indicator bytes 405 without the suffix character indicating bytes, and one block or file containing the suffix character indicating bytes 407 without the prefix group indicator bytes. This is a simple process in the preferred embodiment since these are merely alternating bytes within the incoming uncompressed Unicode data.

The file containing the suffix character bytes 407 can be compressed 411 using any typical data compression method and results in compressed character file or block 415

In the preferred embodiment of the present invention, the file containing the inherently redundant prefix group indicator bytes 405 is compressed using one of a number of compression methodologies and results in compressed prefix file or block 413. One compression technique used in the method of the present invention is a run-length encoding methodology wherein sequences of repeated bytes are replaced with a count of the number of instances of that byte being repeated in the prefix group indicator portion of the Unicode data. A second compression technique used in the method of the present invention is a master+exception list methodology wherein one byte of the prefix group indicator portion of the Unicode data is chosen as a master byte and each exception to the master byte in the prefix group indicator portion of the Unicode data is noted.

Note that compressed prefix file 413 and compressed character file 415 are stored as a single combined compressed Unicode file in the preferred embodiment of the present invention. Note that decompression is provided by merely reversing the compression process, as is indicated in Fig. 4.

In the foregoing specification, the invention has been described with reference to a specific exemplary embodiment and alternative embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specifications and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims

What is claimed is:

1- A method for compressing Unicode data comprising:

a) separating the prefix group indicator portion of the Unicode data from the suffix character portion of the Unicode data; and

b) separately compressing the prefix group indicator portion of the Unicode data and the suffix character portion of the Unicode data.

2. The compression method of Claim 1 wherein the prefix group indicator portion of the Unicode data is compressed using a run-length encoding methodology wherein sequences of repeated bytes are replaced with a count of the number of instances of that byte being repeated in the prefix group indicator portion of the Unicode data.

3. The compression method of Claim 1 wherein the prefix group indicator portion of the Unicode data is compressed using a master+exception list methodology wherein one byte of the prefix group indicator portion of the Unicode data is chosen as a master byte and each exception to the master byte in the prefix group indicator portion of the Unicode data is noted.

4. The compression method of Claim 1 wherein the prefix group indicator portion of the Unicode data is compressed by:

i) separately compressing the prefix group indicator portion of the Unicode data by different compression methodologies; and

ii) using the compression methodology of the different compression methodologies in (i) which yielded the best compression ratio with the compressed prefix group indicator portion of the Unicode data.

5. A method for decompressing compressed Unicode data compressing: a) separately decompressing the compressed prefix group indicator portion of the Unicode data and the compressed suffix character portion of the Unicode data; and

b) joining the decompressed prefix group indicator portion of the Unicode data with the decompressed suffix character portion of the Unicode data.

6. An apparatus for compressing Unicode data comprising:

a) processor means for separating the prefix group indicator portion of the Unicode data from the suffix character portion of the Unicode data;

b) processor means for separately compressing the prefix group indicator portion of the Unicode data and the suffix character portion of the Unicode data; and

c) memory means for storing the compressed prefix group indicator portion of the Unicode data and the compressed suffix character portion of the Unicode data.

7. An apparatus for decompressing Unicode data comprising:

a) processor means for separately decompressing the compressed prefix group indicator portion of the Unicode data and the compressed suffix character portion of the Unicode data;

b) processor means for joining the uncompressed prefix group indicator portion of the Unicode data with the uncompressed suffix character portion of the Unicode data; and

c) memory means for storing the uncompressed Unicode data.