WO1994027374A1 - Method and apparatus for efficient compression of data having redundant characteristics - Google Patents

Method and apparatus for efficient compression of data having redundant characteristics Download PDF

Info

Publication number
WO1994027374A1
WO1994027374A1 PCT/US1994/005320 US9405320W WO9427374A1 WO 1994027374 A1 WO1994027374 A1 WO 1994027374A1 US 9405320 W US9405320 W US 9405320W WO 9427374 A1 WO9427374 A1 WO 9427374A1
Authority
WO
WIPO (PCT)
Prior art keywords
unicode data
data
unicode
group indicator
prefix group
Prior art date
Application number
PCT/US1994/005320
Other languages
French (fr)
Inventor
Ke-Chiang Chu
Daniel J. Culbert
Original Assignee
Apple Computer, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Apple Computer, Inc. filed Critical Apple Computer, Inc.
Priority to AU69118/94A priority Critical patent/AU6911894A/en
Publication of WO1994027374A1 publication Critical patent/WO1994027374A1/en

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/46Conversion to or from run-length codes, i.e. by representing the number of consecutive digits, or groups of digits, of the same kind by a code word and a digit indicative of that kind
    • H03M7/48Conversion to or from run-length codes, i.e. by representing the number of consecutive digits, or groups of digits, of the same kind by a code word and a digit indicative of that kind alternating with other codes during the code conversion process, e.g. run-length coding being performed only as long as sufficientlylong runs of digits of the same kind are present

Definitions

  • the present invention relates generally to the field of data compression, and more particularly to compression of data having inherently redundant characteristics.
  • an 8 bit byte is the traditional unit of computer data.
  • individual characters in a file or data set are separately denoted and stored as single bytes.
  • Commonly known and used single-byte character formats include American Standard Code for
  • ASCII ASCII
  • EBCDIC Extended Binary Coded Decimal Interchange Code
  • Unicode As is well known and is explained in "The Unicode Standard, Worldwide Character Encoding" Version 1.0, Volume One, Copyright 1990, 1991 Unicode, Inc., Unicode is a fixed-width, uniform text and character encoding scheme utilizing a 16-bit architecture which extends the benefits of ASCII to multilingual text. Unicode characters are consistently 16 bits wide, regardless of language, so no escape sequence or control code is generally required to specify any character in any language. Unicode character encoding treats symbols, alphabetic characters, and ideographic characters identically, so that they can be used simultaneously and with equal facility.
  • the Unicode standard currently contains over 28,000 characters, including 2,300 general (alphabetic or syllabic) letters, 1,200 textual symbols, and 3,300 CJK (Chinese/Japanese/Korean) phonetics, punctuation, symbols, Korean
  • the Unicode format utilizes 16 bits for each character represented. Referring now to Figure 1, the format of a single generic Unicode character of 16 bits is shown.
  • the first half (first 8 bits), or prefix, of each character represented in Unicode is an indicator of the group (e.g., math symbol, Kanji, English, etc.) of the particular character being represented-
  • the second half (second 8 bits), or suffix, of each character represented in Unicode indicates which particular character within the indicated group is being represented.
  • Unicode can therefore more easily represent a variety of characters in a single document or file without requiring specialized or localized software.
  • the storage overhead of Unicode data is, by definition, larger than with 8-bit character formats because Unicode data uses 16 bits per character.
  • documents or data files stored in the Unicode format are generally twice as large as would be the same documents or data files represented in ASCII, for example. There is therefore a need to reduce the increased size of Unicode files while still retaining the ability to represent the range of characters supported by the Unicode character format.
  • Typical compression methodologies handle uncompressed data on a byte-by-byte basis. Compressing data on a byte-by-byte basis generally works well for data which is comprised of characters stored in a single byte per character format.
  • an example compression method which is well known in the art processes an uncompressed input data stream 10 to generate a compressed data output stream 20 by comparing an uncompressed portion 13 of input data stream 10 to data in a history buffer 11 of already processed input data. If a matching data string 12 is located in history buffer 11 for current data string 14, data string 14 is encoded in compressed data stream 20 as a pointer (p G , lo) 24, corresponding to an offset p 0 15 and a data length 1 0 16. The shorter length data of pointer (po, lo) 24 thus replaces longer data string 14 in output compressed data stream 20.
  • Another problem is the general doubling of the value of the resulting length (1) and offset (p) values to what is the equivalent matching character as would have occurred in a non-Unicode data format. Doubling the value of either the length (1) or offset (p) values results in a decreased compression ratio, an undesirable side effect.
  • a still further problem is the increased difficulty in finding matching strings when the current data string to be matched occurs at a break between a prefix group indicator and its associated suffix character due to the previous matching string.
  • one is no longer merely trying to match a character and its associated prefix group code with an earlier character and its associated prefix group code. Instead one is trying to match a character and a following prefix group code with an earlier occurrence of the same character and the same following prefix group code.
  • an improved compression methodology is needed to handle the larger data files of the two-byte-per-character Unicode format.
  • An objective of the present invention is to provide an improved method and apparatus for efficient compression of data.
  • Another objective of the present invention is to provide an improved method and apparatus for efficient compression of data having redundant characteristics.
  • a still further objective of the present invention is to provide an improved method and apparatus for efficient compression of data stored in a Unicode character format.
  • a method for compressing Unicode data comprising separating the prefix group indicator portion of the Unicode data from the suffix character portion of the Unicode data and separately compressing the prefix group indicator portion of the Unicode data and the suffix character portion of the Unicode data.
  • an apparatus for compressing Unicode data comprising processor means for separating the prefix group indicator portion of the Unicode data from the suffix character portion of the Unicode data, processor means for separately compressing the prefix group indicator portion of the Unicode data and the suffix character portion of the Unicode data, and memory means for storing the compressed prefix group indicator portion of the Unicode data and the compressed suffix character portion of the Unicode data.
  • One advantage of the present invention is improved compression /decompression of Unicode data.
  • Another advantage of the present invention is improved compression /decompression of data stored in a two-byte per character format.
  • Still another advantage of the present invention is improved compression /decompression of data which is inherently redundant.
  • Fig. 1 depicts the format of a single generic Unicode character
  • Fig. 2 is a generalized block diagram of a typical computer system which might utilize the present invention
  • Fig. 3 depicts a typical sequence of characters stored in the two-byte- per-character Unicode format
  • Fig. 4 is a block diagram of the compression and decompression approach of the present invention.
  • Fig. 5 depicts an example compression and decompression approach of the prior art.
  • FIG. 2 is a generalized block diagram of a typical computer system 210 which might utilize the present invention.
  • Computer system 210 includes a CPU/memory unit 211 that generally comprises a microprocessor, related logic circuitry, and memory circuitry.
  • Input device 213 provides input to the CPU/memory unit 211, which by way of example can be a keyboard, a mouse, a trackball, a joystick, a stylus, a touch screen, a touch tablet, etc., or any combination thereof.
  • External storage 217 which can include fixed disk drives, floppy disk drives, memory cards, etc., is used for mass storage of programs and data.
  • Display output is provided by display 219, which by way of example can be a video display or a liquid crystal display. Note that for some configurations of computer system 210, input device 213 and display 219 may be one and the same, e.g., display 219 may also be a tablet which can be pressed or written on for input purposes.
  • the Unicode standard stores data in a two-byte- per-character format. And because traditional compression methods do not properly handle data stored in multiple bytes-per-character, the increased file size of Unicode data competes with the benefits of using the Unicode format.
  • the present invention overcomes the limitations of the traditional compression methods while still supporting the benefits gained through use of the Unicode standard.
  • Uncompressed Unicode data 401 is first separated into two blocks or files: one block or file containing the inherently redundant prefix group indicator bytes 405 without the suffix character indicating bytes, and one block or file containing the suffix character indicating bytes 407 without the prefix group indicator bytes. This is a simple process in the preferred embodiment since these are merely alternating bytes within the incoming uncompressed Unicode data.
  • the file containing the suffix character bytes 407 can be compressed 411 using any typical data compression method and results in compressed character file or block 415
  • the file containing the inherently redundant prefix group indicator bytes 405 is compressed using one of a number of compression methodologies and results in compressed prefix file or block 413.
  • One compression technique used in the method of the present invention is a run-length encoding methodology wherein sequences of repeated bytes are replaced with a count of the number of instances of that byte being repeated in the prefix group indicator portion of the Unicode data.
  • a second compression technique used in the method of the present invention is a master+exception list methodology wherein one byte of the prefix group indicator portion of the Unicode data is chosen as a master byte and each exception to the master byte in the prefix group indicator portion of the Unicode data is noted.
  • compressed prefix file 413 and compressed character file 415 are stored as a single combined compressed Unicode file in the preferred embodiment of the present invention. Note that decompression is provided by merely reversing the compression process, as is indicated in Fig. 4.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

A method and apparatus for compressing inherently redundant data. A Unicode file is comprised of prefix group indicator bytes and suffix character indicator bytes and can therefore be separated into two files, one containing the prefixes and one containing the suffix characters. Then, each separate file can be separately compressed using means best suited to the characteristics of each. Because of the high degree of redundancy across the prefix group indicator bytes they can be more greatly compressed which in turn results in greater compression of the entire Unicode file. Multiple compression methodologies, equally applicable to any inherently redundant data file, can be applied to the prefix group indicator bytes to yield the best compression results.

Description

METHOD AND APPARATUS FOR EFFICIENT COMPRESSION OF DATA HAVING REDUNDANT CHARACTERISTICS
FIELD OF THE INVENTION
The present invention relates generally to the field of data compression, and more particularly to compression of data having inherently redundant characteristics.
BACKGROUND OF THE INVENTION
In the field of data processing, an 8 bit byte is the traditional unit of computer data. Typically, individual characters in a file or data set are separately denoted and stored as single bytes. Commonly known and used single-byte character formats include American Standard Code for
Information Interchange (ASCII) and Extended Binary Coded Decimal Interchange Code (EBCDIC).
There are however, problems or difficulties which have arisen due to limitations of the 8 bit byte character format. First of all, with only 8 bits per character, there can be only 256 (2 to the 8th power) different characters represented. While 256 characters is generally sufficient for languages such as English, it is generally insufficient for other languages such as Kanji or Chinese. Secondly, 8 bits (256 characters) is generally insufficient to represent a combined language environment, such as English characters intermixed with math symbols and/or control characters.
Furthermore, software vendors have been forced to make 'localized software' when distributing software in multiple countries because 256 characters is generally insufficient to support all of the different characters needed for all of the different languages of those countries. Completion, maintenance and support of localized software can be a tremendous undertaking. Thus, the typical single byte character formats are inadequate in an increasingly complex global computing environment.
For these reasons, a new character format or standard has emerged known as Unicode. As is well known and is explained in "The Unicode Standard, Worldwide Character Encoding" Version 1.0, Volume One, Copyright 1990, 1991 Unicode, Inc., Unicode is a fixed-width, uniform text and character encoding scheme utilizing a 16-bit architecture which extends the benefits of ASCII to multilingual text. Unicode characters are consistently 16 bits wide, regardless of language, so no escape sequence or control code is generally required to specify any character in any language. Unicode character encoding treats symbols, alphabetic characters, and ideographic characters identically, so that they can be used simultaneously and with equal facility.
Because there are 16 bits per character, it is possible to represent up to 65,536 (2 to the 16th power) different characters with Unicode. The Unicode standard currently contains over 28,000 characters, including 2,300 general (alphabetic or syllabic) letters, 1,200 textual symbols, and 3,300 CJK (Chinese/Japanese/Korean) phonetics, punctuation, symbols, Korean
Hangul syllables and over 20,000 Han characters.
The Unicode format, as stated above, utilizes 16 bits for each character represented. Referring now to Figure 1, the format of a single generic Unicode character of 16 bits is shown. The first half (first 8 bits), or prefix, of each character represented in Unicode is an indicator of the group (e.g., math symbol, Kanji, English, etc.) of the particular character being represented- The second half (second 8 bits), or suffix, of each character represented in Unicode indicates which particular character within the indicated group is being represented.
Unicode can therefore more easily represent a variety of characters in a single document or file without requiring specialized or localized software. However, the storage overhead of Unicode data is, by definition, larger than with 8-bit character formats because Unicode data uses 16 bits per character. Thus, documents or data files stored in the Unicode format are generally twice as large as would be the same documents or data files represented in ASCII, for example. There is therefore a need to reduce the increased size of Unicode files while still retaining the ability to represent the range of characters supported by the Unicode character format. Typical compression methodologies handle uncompressed data on a byte-by-byte basis. Compressing data on a byte-by-byte basis generally works well for data which is comprised of characters stored in a single byte per character format. Referring now to Figure 5, an example compression method which is well known in the art processes an uncompressed input data stream 10 to generate a compressed data output stream 20 by comparing an uncompressed portion 13 of input data stream 10 to data in a history buffer 11 of already processed input data. If a matching data string 12 is located in history buffer 11 for current data string 14, data string 14 is encoded in compressed data stream 20 as a pointer (pG, lo) 24, corresponding to an offset p0 15 and a data length 10 16. The shorter length data of pointer (po, lo) 24 thus replaces longer data string 14 in output compressed data stream 20.
Unfortunately, such prior art compression approaches do not work as well with Unicode data because each character in the Unicode format is comprised of two bytes. One problem is the greater time needed to compress Unicode data. This is because Unicode data comprises suffix character data interspersed with prefix group indicators and hence, in general, more bytes have to be scanned in order to find each match.
Another problem is the general doubling of the value of the resulting length (1) and offset (p) values to what is the equivalent matching character as would have occurred in a non-Unicode data format. Doubling the value of either the length (1) or offset (p) values results in a decreased compression ratio, an undesirable side effect.
A still further problem is the increased difficulty in finding matching strings when the current data string to be matched occurs at a break between a prefix group indicator and its associated suffix character due to the previous matching string. In that situation, one is no longer merely trying to match a character and its associated prefix group code with an earlier character and its associated prefix group code. Instead one is trying to match a character and a following prefix group code with an earlier occurrence of the same character and the same following prefix group code. There is less likelihood of finding such a match and so this too results in a decreased compression ratio. Thus, an improved compression methodology is needed to handle the larger data files of the two-byte-per-character Unicode format.
SUMMARY AND OBTECTS OF THE INVENTION
An objective of the present invention is to provide an improved method and apparatus for efficient compression of data.
Another objective of the present invention is to provide an improved method and apparatus for efficient compression of data having redundant characteristics.
A still further objective of the present invention is to provide an improved method and apparatus for efficient compression of data stored in a Unicode character format.
The foregoing and other advantages are provided by a method for compressing Unicode data comprising separating the prefix group indicator portion of the Unicode data from the suffix character portion of the Unicode data and separately compressing the prefix group indicator portion of the Unicode data and the suffix character portion of the Unicode data.
The foregoing and other advantages are further provided by a compression method wherein the prefix group indicator portion of the Unicode data is separately compressed by different compression methodologies and then using the compression methodology which yielded the best compression ratio with the prefix group indicator portion of the Unicode data.
The foregoing and other advantages are also provided by an apparatus for compressing Unicode data comprising processor means for separating the prefix group indicator portion of the Unicode data from the suffix character portion of the Unicode data, processor means for separately compressing the prefix group indicator portion of the Unicode data and the suffix character portion of the Unicode data, and memory means for storing the compressed prefix group indicator portion of the Unicode data and the compressed suffix character portion of the Unicode data.
One advantage of the present invention is improved compression /decompression of Unicode data. Another advantage of the present invention is improved compression /decompression of data stored in a two-byte per character format.
Still another advantage of the present invention is improved compression /decompression of data which is inherently redundant.
Other objects, features and advantages of the present invention will be apparent from the accompanying drawings and from the detailed description which follows.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which:
Fig. 1 depicts the format of a single generic Unicode character;
Fig. 2 is a generalized block diagram of a typical computer system which might utilize the present invention;
Fig. 3 depicts a typical sequence of characters stored in the two-byte- per-character Unicode format;
Fig. 4 is a block diagram of the compression and decompression approach of the present invention; and
Fig. 5 depicts an example compression and decompression approach of the prior art.
DETAILED DESCRIPTION OF THE INVENTION
Figure 2 is a generalized block diagram of a typical computer system 210 which might utilize the present invention. Computer system 210 includes a CPU/memory unit 211 that generally comprises a microprocessor, related logic circuitry, and memory circuitry. Input device 213 provides input to the CPU/memory unit 211, which by way of example can be a keyboard, a mouse, a trackball, a joystick, a stylus, a touch screen, a touch tablet, etc., or any combination thereof. External storage 217, which can include fixed disk drives, floppy disk drives, memory cards, etc., is used for mass storage of programs and data. Display output is provided by display 219, which by way of example can be a video display or a liquid crystal display. Note that for some configurations of computer system 210, input device 213 and display 219 may be one and the same, e.g., display 219 may also be a tablet which can be pressed or written on for input purposes.
As has been explained, the Unicode standard stores data in a two-byte- per-character format. And because traditional compression methods do not properly handle data stored in multiple bytes-per-character, the increased file size of Unicode data competes with the benefits of using the Unicode format. The present invention overcomes the limitations of the traditional compression methods while still supporting the benefits gained through use of the Unicode standard.
Referring now to Figure 3, a typical sequence of two-byte characters stored in a Unicode format can be seen. Bytes 1, 3 and 5 of the character sequence are the prefix group indicator bytes while bytes 2, 4 and 6 are the suffix characters themselves (see discussion with reference to Figure 1).
It is important to note here that while the prefix group indicator bytes
1, 3 and 5 of Figure 3 could be different from each other, because most documents tend to use one language there tends to be a high degree of redundancy in the group indicator bytes within a single document or file stored in the Unicode format. In other words, because a single document tends to be written primarily in a single language (e.g., English, Japanese, etc.) with a fewer number of other types of characters (e.g., control, mathematics symbols, etc.) intermixed, there tends to be a high degree of redundancy across the group indicator bytes within a file. This inherent redundancy of Unicode data (where generally every other byte is a group indicator byte) can be greatly utilized to improve compression of such data, as will be explained more fully herein.
Referring now to Figure 4, the compression methodology of the preferred embodiment of the present invention will now be explained. Uncompressed Unicode data 401 is first separated into two blocks or files: one block or file containing the inherently redundant prefix group indicator bytes 405 without the suffix character indicating bytes, and one block or file containing the suffix character indicating bytes 407 without the prefix group indicator bytes. This is a simple process in the preferred embodiment since these are merely alternating bytes within the incoming uncompressed Unicode data.
The file containing the suffix character bytes 407 can be compressed 411 using any typical data compression method and results in compressed character file or block 415
In the preferred embodiment of the present invention, the file containing the inherently redundant prefix group indicator bytes 405 is compressed using one of a number of compression methodologies and results in compressed prefix file or block 413. One compression technique used in the method of the present invention is a run-length encoding methodology wherein sequences of repeated bytes are replaced with a count of the number of instances of that byte being repeated in the prefix group indicator portion of the Unicode data. A second compression technique used in the method of the present invention is a master+exception list methodology wherein one byte of the prefix group indicator portion of the Unicode data is chosen as a master byte and each exception to the master byte in the prefix group indicator portion of the Unicode data is noted.
Note that compressed prefix file 413 and compressed character file 415 are stored as a single combined compressed Unicode file in the preferred embodiment of the present invention. Note that decompression is provided by merely reversing the compression process, as is indicated in Fig. 4.
In the foregoing specification, the invention has been described with reference to a specific exemplary embodiment and alternative embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specifications and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims

What is claimed is:
1- A method for compressing Unicode data comprising:
a) separating the prefix group indicator portion of the Unicode data from the suffix character portion of the Unicode data; and
b) separately compressing the prefix group indicator portion of the Unicode data and the suffix character portion of the Unicode data.
2. The compression method of Claim 1 wherein the prefix group indicator portion of the Unicode data is compressed using a run-length encoding methodology wherein sequences of repeated bytes are replaced with a count of the number of instances of that byte being repeated in the prefix group indicator portion of the Unicode data.
3. The compression method of Claim 1 wherein the prefix group indicator portion of the Unicode data is compressed using a master+exception list methodology wherein one byte of the prefix group indicator portion of the Unicode data is chosen as a master byte and each exception to the master byte in the prefix group indicator portion of the Unicode data is noted.
4. The compression method of Claim 1 wherein the prefix group indicator portion of the Unicode data is compressed by:
i) separately compressing the prefix group indicator portion of the Unicode data by different compression methodologies; and
ii) using the compression methodology of the different compression methodologies in (i) which yielded the best compression ratio with the compressed prefix group indicator portion of the Unicode data.
5. A method for decompressing compressed Unicode data compressing: a) separately decompressing the compressed prefix group indicator portion of the Unicode data and the compressed suffix character portion of the Unicode data; and
b) joining the decompressed prefix group indicator portion of the Unicode data with the decompressed suffix character portion of the Unicode data.
6. An apparatus for compressing Unicode data comprising:
a) processor means for separating the prefix group indicator portion of the Unicode data from the suffix character portion of the Unicode data;
b) processor means for separately compressing the prefix group indicator portion of the Unicode data and the suffix character portion of the Unicode data; and
c) memory means for storing the compressed prefix group indicator portion of the Unicode data and the compressed suffix character portion of the Unicode data.
7. An apparatus for decompressing Unicode data comprising:
a) processor means for separately decompressing the compressed prefix group indicator portion of the Unicode data and the compressed suffix character portion of the Unicode data;
b) processor means for joining the uncompressed prefix group indicator portion of the Unicode data with the uncompressed suffix character portion of the Unicode data; and
c) memory means for storing the uncompressed Unicode data.
PCT/US1994/005320 1993-05-13 1994-05-13 Method and apparatus for efficient compression of data having redundant characteristics WO1994027374A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU69118/94A AU6911894A (en) 1993-05-13 1994-05-13 Method and apparatus for efficient compression of data having redundant characteristics

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US6164893A 1993-05-13 1993-05-13
US08/061,648 1993-05-13

Publications (1)

Publication Number Publication Date
WO1994027374A1 true WO1994027374A1 (en) 1994-11-24

Family

ID=22037174

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1994/005320 WO1994027374A1 (en) 1993-05-13 1994-05-13 Method and apparatus for efficient compression of data having redundant characteristics

Country Status (2)

Country Link
AU (1) AU6911894A (en)
WO (1) WO1994027374A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0789460A1 (en) * 1996-02-09 1997-08-13 Fujitsu Limited Data compression/decompression apparatus and method
GB2360915A (en) * 2000-03-30 2001-10-03 Sony Uk Ltd Run length compression encoding of selected bits of data words
EP2164176A1 (en) * 2008-09-12 2010-03-17 Thomson Licensing Method for lossless compressing prefix-suffix-codes, method for decompressing a bit sequence representing integers or symbols encoded in compressed prefix-suffix-codes and storage medium or signal carrying compressed prefix-suffix-codes
CN102760119A (en) * 2012-07-11 2012-10-31 北京理工大学 Method for storing Unicode coded character string in embedded device
CN111611214A (en) * 2020-05-25 2020-09-01 广州翔声智能科技有限公司 Big data storage algorithm

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4866440A (en) * 1986-12-12 1989-09-12 Hitachi, Ltd. Method for compressing and restoring data series and apparatus for realizing same

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4866440A (en) * 1986-12-12 1989-09-12 Hitachi, Ltd. Method for compressing and restoring data series and apparatus for realizing same

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0789460A1 (en) * 1996-02-09 1997-08-13 Fujitsu Limited Data compression/decompression apparatus and method
US5889481A (en) * 1996-02-09 1999-03-30 Fujitsu Limited Character compression and decompression device capable of handling a plurality of different languages in a single text
GB2360915A (en) * 2000-03-30 2001-10-03 Sony Uk Ltd Run length compression encoding of selected bits of data words
GB2360915B (en) * 2000-03-30 2004-01-28 Sony Uk Ltd Data compression
EP2164176A1 (en) * 2008-09-12 2010-03-17 Thomson Licensing Method for lossless compressing prefix-suffix-codes, method for decompressing a bit sequence representing integers or symbols encoded in compressed prefix-suffix-codes and storage medium or signal carrying compressed prefix-suffix-codes
WO2010028967A1 (en) * 2008-09-12 2010-03-18 Thomson Licensing Method for lossless compressing prefix-suffix-codes, method for decompressing a bit sequence representing integers or symbols encoded in compressed prefix-suffix-codes and storage medium or signal carrying compressed prefix-suffix-codes
CN102760119A (en) * 2012-07-11 2012-10-31 北京理工大学 Method for storing Unicode coded character string in embedded device
CN111611214A (en) * 2020-05-25 2020-09-01 广州翔声智能科技有限公司 Big data storage algorithm
CN111611214B (en) * 2020-05-25 2023-08-18 广州翔声智能科技有限公司 Big data storage method

Also Published As

Publication number Publication date
AU6911894A (en) 1994-12-12

Similar Documents

Publication Publication Date Title
US5444445A (en) Master + exception list method and apparatus for efficient compression of data having redundant characteristics
JP3009727B2 (en) Improved data compression device
EP0584992B1 (en) Text compression technique using frequency ordered array of word number mappers
JP3499671B2 (en) Data compression device and data decompression device
US7817069B2 (en) Alternative encoding for LZSS output
TW312771B (en)
EP1040618B1 (en) Method and apparatus for simultaneously encrypting and compressing data
US7305541B2 (en) Compression of program instructions using advanced sequential correlation
Severance A practitioner's guide to data base compression tutorial
US6094634A (en) Data compressing apparatus, data decompressing apparatus, data compressing method, data decompressing method, and program recording medium
JPH08330974A (en) Data compression method and circuit for restoring compressed code
US6737994B2 (en) Binary-ordered compression for unicode
Abel et al. Universal text preprocessing for data compression
EP0703674A2 (en) Method and apparatus for numeric-to-string conversion
US5502439A (en) Method for compression of binary data
EP0450049B1 (en) Character encoding
US6834283B1 (en) Data compression/decompression apparatus using additional code and method thereof
WO1994027374A1 (en) Method and apparatus for efficient compression of data having redundant characteristics
US5710919A (en) Record compression
JP2008192163A (en) Unicode converter
Jrai et al. Improving LZW Compression of Unicode Arabic Text Using Multi-Level Encoding and a Variable-Length Phrase Code
US6731229B2 (en) Method to reduce storage requirements when storing semi-redundant information in a database
JPH0546357A (en) Compressing method and restoring method for text data
JPH07182354A (en) Method for generating electronic document
Felician et al. A nearly optimal Huffman technique in the microcomputer environment

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AT AU BB BG BR BY CA CH CN CZ DE DK ES FI GB GE HU JP KG KP KR KZ LK LU LV MD MG MN MW NL NO NZ PL PT RO RU SD SE SI SK TJ TT UA UZ VN

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH DE DK ES FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: CA