WO2002075929A1 - A method for compressing data - Google Patents
A method for compressing data Download PDFInfo
- Publication number
- WO2002075929A1 WO2002075929A1 PCT/FI2002/000194 FI0200194W WO02075929A1 WO 2002075929 A1 WO2002075929 A1 WO 2002075929A1 FI 0200194 W FI0200194 W FI 0200194W WO 02075929 A1 WO02075929 A1 WO 02075929A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- node
- code length
- tree
- nodes
- probability distribution
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/40—Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/40—Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
- H03M7/4006—Conversion to or from arithmetic code
Definitions
- the present invention relates to a method for compressing data according to the preamble of Claim 1.
- the invention also relates to a device according to the preamble of Claim 9.
- the purpose of the present invention is to provide an improved method and a device to losslessly compress short data.
- the invention is based on the idea that an algorithm is used to examine the nodes of the Markov machine to determine the leaves of the nodes that can be eliminated. This algorithm then produces a simpler machine for compressing data. This compression machine is also called a tree machine in this description. More precisely, the method according to the invention is characterized in that what will be presented in the characterising part of Claim 1. The device according to the invention is characterized in that what will be is presented in the characterising part of Claim 9.
- TM Tree Machines
- an arithmetic code is designed for encoding each short file symbol for symbol using predesigned (non-adaptive) addend (codeword) tables at the so-called encoding nodes.
- An encoding node is the deepest node in the tree where the symbol occurring in the training data has positive count. The symbols that do not occur will be encoded with a suitably selected escape mechanism to be described later in this disclosure.
- the invention gives several advantages related to prior art.
- the compression of especially short texts is performed much more efficiently than with prior art methods.
- This improved compression efficiency reduces the amount of memory space required for storing such compressed data, information transferring demands, and the time needed for transferring the data.
- Fig. 1a is a schematic view showing the steps for compressing a short data file according to a method of an advantageous embodiment of the present invention
- Fig. 1 b is a simplified block diagram of an electronic device in which the method of an advantageous embodiment of the present invention can be applied,
- Fig. 2 describes the construction of a maximal tree machine from the training data and the construction of a sub tree machine by pruning the maximal tree with an algorithm according to an advantageous embodiment of the present invention
- Fig. 3 describes the compression of a data block using an arithmetic coder or a Huffman code and the probabilities from the pruned TM,
- Fig. 4 describes the decompression of the compressed data block
- Fig. 5 shows simulation results of compression with an advantageous method of the invention and with some prior art methods
- Fig. 6 illustrates an example of TM built from an example training data
- Fig. 7 illustrates testing of some nodes of the context tree presented in Figure 6 for pruning
- Fig. 8 illustrates an example of pruning a node
- Fig. 9 illustrates an example of checking a node for pruning
- Fig. 10 shows an example of a context tree
- Fig. 11 shows the recursive implementation of an advantageous algorithm of the method according to the invention and an example of a context tree
- Fig. 12 shows a block diagram of the implementation of another advantageous algorithm of the method according to the invention.
- a s be the set of distinct symbols immediately following the various occurrences of s, and let the symbol / occur n,
- the TM has the counts ⁇ ⁇ ;
- FIG. 6 illustrates a TM built from training data "VESIHIISI SIHISI HISSISSA".
- TM built from training data "VESIHIISI SIHISI HISSISSA”.
- the encoding process will be described later, but first the so-called ideal code length or the empirical entropy each TM assigns to the training string is described.
- the ideal code length of symbol /that occurs at node s is defined as log(/? s / ⁇ ,
- s(t) denote the deepest node in the tree, call it a leaf node, where symbol x t of the training data occurs. This node is defined by climbing the tree by reading the symbols of the past string ⁇ ⁇ 1 from right to left.
- the sum of the ideal code lengths of all the symbols x t of the training data is the ideal code length assigned to the training string x" by the Tree Machine W, or
- a Tree Machine is used to encode strings much as a finite state machine:
- the nodes s could be taken as the deepest nodes in the tree along the path y t , y M , ⁇ • ⁇
- the nodes s could be taken as the deepest nodes in the tree along the path y t , y M , ⁇ • ⁇
- the symbols occur at every node, which creates the problem of how to encode symbols for which the counts /? /
- an arithmetic code or a Huffman code at each deepest coding node can be designed.
- the size of the resulting tree depends on the value ⁇ .
- ⁇ 0, the pruned tree will compress at least as well as the unpruned tree, but is usually smaller.
- ⁇ > the pruned tree will be smaller than needed for the maximal compression, but is still the best one among all trees of its size.
- the pruned tree size will be 1 , consisting only the root node.
- q s stands for the total number of different symbols seen in the node s
- n is the count of symbol / in the node s.
- S is the count of symbol / in the node s.
- the sum of the counts in the node s is denoted as n s .
- new code lengths / are calculated using the same formula with the counts of the child and the probabilities of the parent node, "HI":
- the pruning procedure is continued recursively until the root node is reached, which cannot be pruned (since it has no parent node and cannot be compared to anything).
- Figure 10 shows an example of such a context tree.
- a message "VIISI” is compressed using these code words.
- the longest match of the preceding symbols found in the model tree is to be searched for.
- the context of the symbol S is the longest of the contexts "VII", “II”, “I” and ⁇ that exists in the model tree.
- the match found is empty ( ⁇ ).
- the found matches are ⁇ , "I", “II”, “IIS” and "SI", respectively.
- the codeword for V in the context ⁇ is 010.
- the codeword for I in the context ⁇ is 10.
- the codeword for I in the context "I” is 110 and so on.
- the compressed string is 010 10 110 0 0 1111.
- Decompression is done similarly.
- the model to be used is the same as in the compression, so the Huffman code table is already known.
- For the first symbol it is known that its context is empty.
- the first codeword (010) is read and the symbol V is the result.
- For the next symbol the context is the longest of the contexts "V” and ⁇ that exists in the model tree.
- the next codeword is now read and the result is the symbol I.
- the context of the next symbol is the longest of the contexts "VI", "I” and ⁇ that exists in the model tree. Again, the next codeword is read and the result is again symbol I and so on.
- the compressed string 010 10 110 0 0 1111 translates to the original message "VIISI".
- the model tree is converted into a state machine.
- Each node (or context) of the tree becomes a state.
- Each symbol to be compressed causes a state transition to the state for the context of the next symbol (which is known by the symbols read this far).
- the tree (which now is a state machine) needs not to be searched every time a symbol is read, and the context finding algorithm becomes 0(1)-function.
- CORESSION compresses small files much better than the other five algorithms.
- PKZIP 800 kilobytes per second.
- ACE 100 kilobytes per second.
- BZIP2 200 kilobytes per second. • PPMZ: 20 kilobytes per second.
- CORESSION 400 kilobytes per second with Arithmetic coding, or 900 kilobytes per second with Huffman codes
- the data structures used are fixed. That is, the data structures do not need to be modified during compression or decompression, as opposed to any adaptive compression program.
- the method of the present invention can be made at least as fast as any adaptive algorithm having at least as complex context and count structures and coding scheme as the method of the present invention.
- COMPRESS a few hundred kilobytes.
- PKZIP a few hundred kilobytes.
- ACE up to 36 megabytes, typically several megabytes.
- BZIP2 two to six times the size of the file to be compressed.
- Fig. 1b there is shown as a simplified block diagram of an electronic device 18 in which the method of an advantageous embodiment of the present invention can be applied.
- the electronic device comprises at least a first input/output block 19 for inputting the training data from e.g. a database 23.
- the controlling unit 20 is provided for controlling the electronic device 18 and for performing the steps of the method of the present invention.
- the controlling unit 20 may comprise one or more processor, such as a microprocessor and/or a digital signal processor.
- the memory means 21 are provided for storing necessary program codes for the operation of the controlling unit, for storing temporary data, for storing the data structures, for storing the data to be compressed, etc.
- the data to be compressed can be read e.g.
- the compressed data can be saved into the memory means 21 , into the database 23, and/or it can be transmitted to e.g. a data transmission channel 24 for transmission to a receiving device (not shown) where the compressed data can be decompressed.
- the receiving device can also be similar than the electronic device presented in Fig. 1.
- the decompression can be performed by using similar data structures (Tree Machine, Finite State Machine) than is used in the compression.
- the pruned tree machine can be transformed to a finite state machine, which is known as such.
- the compression can normally be performed faster by the finite state machine than by the pruned tree machine.
- the present invention can be applied in many applications. For example, in mobile telecommunication environment short messages can be compressed for reducing needed transmission capacity before sending them in a mobile network.
- the present invention can also be applied in computers for compressing text files which then can be saved into storage medium.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FI20010525A FI110374B (fi) | 2001-03-16 | 2001-03-16 | Menetelmä tiedon pakkaamiseksi |
FI20010525 | 2001-03-16 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2002075929A1 true WO2002075929A1 (en) | 2002-09-26 |
Family
ID=8560753
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/FI2002/000194 WO2002075929A1 (en) | 2001-03-16 | 2002-03-13 | A method for compressing data |
Country Status (2)
Country | Link |
---|---|
FI (1) | FI110374B (fi) |
WO (1) | WO2002075929A1 (fi) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7028042B2 (en) * | 2002-05-03 | 2006-04-11 | Jorma Rissanen | Lossless data compression system |
CN113609344A (zh) * | 2021-09-29 | 2021-11-05 | 北京泰迪熊移动科技有限公司 | 字节流状态机的构建方法及装置、电子设备、存储介质 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1995014350A1 (en) * | 1993-11-15 | 1995-05-26 | National Semiconductor Corporation | Quadtree-structured walsh transform coding |
US5534861A (en) * | 1993-04-16 | 1996-07-09 | International Business Machines Corporation | Method and system for adaptively building a static Ziv-Lempel dictionary for database compression |
WO1997036376A1 (en) * | 1996-03-28 | 1997-10-02 | Vxtreme, Inc. | Table-based compression with embedded coding |
-
2001
- 2001-03-16 FI FI20010525A patent/FI110374B/fi not_active IP Right Cessation
-
2002
- 2002-03-13 WO PCT/FI2002/000194 patent/WO2002075929A1/en not_active Application Discontinuation
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5534861A (en) * | 1993-04-16 | 1996-07-09 | International Business Machines Corporation | Method and system for adaptively building a static Ziv-Lempel dictionary for database compression |
WO1995014350A1 (en) * | 1993-11-15 | 1995-05-26 | National Semiconductor Corporation | Quadtree-structured walsh transform coding |
WO1997036376A1 (en) * | 1996-03-28 | 1997-10-02 | Vxtreme, Inc. | Table-based compression with embedded coding |
Non-Patent Citations (2)
Title |
---|
GINESTA XAVIER ET AL.: "Vector quantization of contextual information for lossless image compression", DATA COMPRESSION CONFERENCE, 1995. DCC'94, 29 March 1994 (1994-03-29) - 31 March 1994 (1994-03-31), pages 390 - 399 * |
SAUPE DIETMAR ET AL.: "Optimal hierarchical partitions for fractal image compression", INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, 1998. ICIP, PROCEEDINGS, vol. 1, 4 October 1998 (1998-10-04) - 7 October 1998 (1998-10-07), pages 737 - 741 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7028042B2 (en) * | 2002-05-03 | 2006-04-11 | Jorma Rissanen | Lossless data compression system |
CN113609344A (zh) * | 2021-09-29 | 2021-11-05 | 北京泰迪熊移动科技有限公司 | 字节流状态机的构建方法及装置、电子设备、存储介质 |
Also Published As
Publication number | Publication date |
---|---|
FI20010525A (fi) | 2002-09-17 |
FI110374B (fi) | 2002-12-31 |
FI20010525A0 (fi) | 2001-03-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Fiala et al. | Data compression with finite windows | |
CA2321233C (en) | Block-wise adaptive statistical data compressor | |
EP0695040B1 (en) | Data compressing method and data decompressing method | |
US5532694A (en) | Data compression apparatus and method using matching string searching and Huffman encoding | |
US5270712A (en) | Sort order preserving method for data storage compression | |
Nelson et al. | The data compression book 2nd edition | |
US5841376A (en) | Data compression and decompression scheme using a search tree in which each entry is stored with an infinite-length character string | |
US5229768A (en) | Adaptive data compression system | |
US6100824A (en) | System and method for data compression | |
JP3258552B2 (ja) | データ圧縮装置及びデータ復元装置 | |
US7764202B2 (en) | Lossless data compression with separated index values and literal values in output stream | |
JPH09162748A (ja) | データ符号化方法、データ復号方法、データ圧縮装置、データ復元装置、及びデータ圧縮・復元システム | |
JPH09246991A (ja) | データ圧縮・復元方法及びデータ圧縮装置及びデータ復元装置 | |
US6304676B1 (en) | Apparatus and method for successively refined competitive compression with redundant decompression | |
US5594435A (en) | Permutation-based data compression | |
JPS6356726B2 (fi) | ||
US20030212695A1 (en) | Lossless data compression system | |
EP1266455A1 (en) | Method and apparatus for optimized lossless compression using a plurality of coders | |
Chen et al. | E cient Lossless Compression of Trees and Graphs | |
Rathore et al. | A brief study of data compression algorithms | |
KR100494876B1 (ko) | 2바이트 문자 데이터 압축 방법 | |
WO2002075929A1 (en) | A method for compressing data | |
Ghuge | Map and Trie based Compression Algorithm for Data Transmission | |
Yokoo | An adaptive data compression method based on context sorting | |
Senthil et al. | Text compression algorithms: A comparative study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ CZ DE DE DK DK DM DZ EC EE EE ES FI FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase |
Ref country code: JP |
|
WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |