WO2002075929A1 - A method for compressing data - Google Patents

A method for compressing data Download PDF

Info

Publication number
WO2002075929A1
WO2002075929A1 PCT/FI2002/000194 FI0200194W WO02075929A1 WO 2002075929 A1 WO2002075929 A1 WO 2002075929A1 FI 0200194 W FI0200194 W FI 0200194W WO 02075929 A1 WO02075929 A1 WO 02075929A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
code length
tree
nodes
probability distribution
Prior art date
Application number
PCT/FI2002/000194
Other languages
English (en)
French (fr)
Inventor
Petri Kuukkanen
Jukka Saarinen
Original Assignee
Coression Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Coression Oy filed Critical Coression Oy
Publication of WO2002075929A1 publication Critical patent/WO2002075929A1/en

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • H03M7/4006Conversion to or from arithmetic code

Definitions

  • the present invention relates to a method for compressing data according to the preamble of Claim 1.
  • the invention also relates to a device according to the preamble of Claim 9.
  • the purpose of the present invention is to provide an improved method and a device to losslessly compress short data.
  • the invention is based on the idea that an algorithm is used to examine the nodes of the Markov machine to determine the leaves of the nodes that can be eliminated. This algorithm then produces a simpler machine for compressing data. This compression machine is also called a tree machine in this description. More precisely, the method according to the invention is characterized in that what will be presented in the characterising part of Claim 1. The device according to the invention is characterized in that what will be is presented in the characterising part of Claim 9.
  • TM Tree Machines
  • an arithmetic code is designed for encoding each short file symbol for symbol using predesigned (non-adaptive) addend (codeword) tables at the so-called encoding nodes.
  • An encoding node is the deepest node in the tree where the symbol occurring in the training data has positive count. The symbols that do not occur will be encoded with a suitably selected escape mechanism to be described later in this disclosure.
  • the invention gives several advantages related to prior art.
  • the compression of especially short texts is performed much more efficiently than with prior art methods.
  • This improved compression efficiency reduces the amount of memory space required for storing such compressed data, information transferring demands, and the time needed for transferring the data.
  • Fig. 1a is a schematic view showing the steps for compressing a short data file according to a method of an advantageous embodiment of the present invention
  • Fig. 1 b is a simplified block diagram of an electronic device in which the method of an advantageous embodiment of the present invention can be applied,
  • Fig. 2 describes the construction of a maximal tree machine from the training data and the construction of a sub tree machine by pruning the maximal tree with an algorithm according to an advantageous embodiment of the present invention
  • Fig. 3 describes the compression of a data block using an arithmetic coder or a Huffman code and the probabilities from the pruned TM,
  • Fig. 4 describes the decompression of the compressed data block
  • Fig. 5 shows simulation results of compression with an advantageous method of the invention and with some prior art methods
  • Fig. 6 illustrates an example of TM built from an example training data
  • Fig. 7 illustrates testing of some nodes of the context tree presented in Figure 6 for pruning
  • Fig. 8 illustrates an example of pruning a node
  • Fig. 9 illustrates an example of checking a node for pruning
  • Fig. 10 shows an example of a context tree
  • Fig. 11 shows the recursive implementation of an advantageous algorithm of the method according to the invention and an example of a context tree
  • Fig. 12 shows a block diagram of the implementation of another advantageous algorithm of the method according to the invention.
  • a s be the set of distinct symbols immediately following the various occurrences of s, and let the symbol / occur n,
  • the TM has the counts ⁇ ⁇ ;
  • FIG. 6 illustrates a TM built from training data "VESIHIISI SIHISI HISSISSA".
  • TM built from training data "VESIHIISI SIHISI HISSISSA”.
  • the encoding process will be described later, but first the so-called ideal code length or the empirical entropy each TM assigns to the training string is described.
  • the ideal code length of symbol /that occurs at node s is defined as log(/? s / ⁇ ,
  • s(t) denote the deepest node in the tree, call it a leaf node, where symbol x t of the training data occurs. This node is defined by climbing the tree by reading the symbols of the past string ⁇ ⁇ 1 from right to left.
  • the sum of the ideal code lengths of all the symbols x t of the training data is the ideal code length assigned to the training string x" by the Tree Machine W, or
  • a Tree Machine is used to encode strings much as a finite state machine:
  • the nodes s could be taken as the deepest nodes in the tree along the path y t , y M , ⁇ • ⁇
  • the nodes s could be taken as the deepest nodes in the tree along the path y t , y M , ⁇ • ⁇
  • the symbols occur at every node, which creates the problem of how to encode symbols for which the counts /? /
  • an arithmetic code or a Huffman code at each deepest coding node can be designed.
  • the size of the resulting tree depends on the value ⁇ .
  • 0, the pruned tree will compress at least as well as the unpruned tree, but is usually smaller.
  • ⁇ > the pruned tree will be smaller than needed for the maximal compression, but is still the best one among all trees of its size.
  • the pruned tree size will be 1 , consisting only the root node.
  • q s stands for the total number of different symbols seen in the node s
  • n is the count of symbol / in the node s.
  • S is the count of symbol / in the node s.
  • the sum of the counts in the node s is denoted as n s .
  • new code lengths / are calculated using the same formula with the counts of the child and the probabilities of the parent node, "HI":
  • the pruning procedure is continued recursively until the root node is reached, which cannot be pruned (since it has no parent node and cannot be compared to anything).
  • Figure 10 shows an example of such a context tree.
  • a message "VIISI” is compressed using these code words.
  • the longest match of the preceding symbols found in the model tree is to be searched for.
  • the context of the symbol S is the longest of the contexts "VII", “II”, “I” and ⁇ that exists in the model tree.
  • the match found is empty ( ⁇ ).
  • the found matches are ⁇ , "I", “II”, “IIS” and "SI", respectively.
  • the codeword for V in the context ⁇ is 010.
  • the codeword for I in the context ⁇ is 10.
  • the codeword for I in the context "I” is 110 and so on.
  • the compressed string is 010 10 110 0 0 1111.
  • Decompression is done similarly.
  • the model to be used is the same as in the compression, so the Huffman code table is already known.
  • For the first symbol it is known that its context is empty.
  • the first codeword (010) is read and the symbol V is the result.
  • For the next symbol the context is the longest of the contexts "V” and ⁇ that exists in the model tree.
  • the next codeword is now read and the result is the symbol I.
  • the context of the next symbol is the longest of the contexts "VI", "I” and ⁇ that exists in the model tree. Again, the next codeword is read and the result is again symbol I and so on.
  • the compressed string 010 10 110 0 0 1111 translates to the original message "VIISI".
  • the model tree is converted into a state machine.
  • Each node (or context) of the tree becomes a state.
  • Each symbol to be compressed causes a state transition to the state for the context of the next symbol (which is known by the symbols read this far).
  • the tree (which now is a state machine) needs not to be searched every time a symbol is read, and the context finding algorithm becomes 0(1)-function.
  • CORESSION compresses small files much better than the other five algorithms.
  • PKZIP 800 kilobytes per second.
  • ACE 100 kilobytes per second.
  • BZIP2 200 kilobytes per second. • PPMZ: 20 kilobytes per second.
  • CORESSION 400 kilobytes per second with Arithmetic coding, or 900 kilobytes per second with Huffman codes
  • the data structures used are fixed. That is, the data structures do not need to be modified during compression or decompression, as opposed to any adaptive compression program.
  • the method of the present invention can be made at least as fast as any adaptive algorithm having at least as complex context and count structures and coding scheme as the method of the present invention.
  • COMPRESS a few hundred kilobytes.
  • PKZIP a few hundred kilobytes.
  • ACE up to 36 megabytes, typically several megabytes.
  • BZIP2 two to six times the size of the file to be compressed.
  • Fig. 1b there is shown as a simplified block diagram of an electronic device 18 in which the method of an advantageous embodiment of the present invention can be applied.
  • the electronic device comprises at least a first input/output block 19 for inputting the training data from e.g. a database 23.
  • the controlling unit 20 is provided for controlling the electronic device 18 and for performing the steps of the method of the present invention.
  • the controlling unit 20 may comprise one or more processor, such as a microprocessor and/or a digital signal processor.
  • the memory means 21 are provided for storing necessary program codes for the operation of the controlling unit, for storing temporary data, for storing the data structures, for storing the data to be compressed, etc.
  • the data to be compressed can be read e.g.
  • the compressed data can be saved into the memory means 21 , into the database 23, and/or it can be transmitted to e.g. a data transmission channel 24 for transmission to a receiving device (not shown) where the compressed data can be decompressed.
  • the receiving device can also be similar than the electronic device presented in Fig. 1.
  • the decompression can be performed by using similar data structures (Tree Machine, Finite State Machine) than is used in the compression.
  • the pruned tree machine can be transformed to a finite state machine, which is known as such.
  • the compression can normally be performed faster by the finite state machine than by the pruned tree machine.
  • the present invention can be applied in many applications. For example, in mobile telecommunication environment short messages can be compressed for reducing needed transmission capacity before sending them in a mobile network.
  • the present invention can also be applied in computers for compressing text files which then can be saved into storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
PCT/FI2002/000194 2001-03-16 2002-03-13 A method for compressing data WO2002075929A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20010525A FI110374B (fi) 2001-03-16 2001-03-16 Menetelmä tiedon pakkaamiseksi
FI20010525 2001-03-16

Publications (1)

Publication Number Publication Date
WO2002075929A1 true WO2002075929A1 (en) 2002-09-26

Family

ID=8560753

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2002/000194 WO2002075929A1 (en) 2001-03-16 2002-03-13 A method for compressing data

Country Status (2)

Country Link
FI (1) FI110374B (fi)
WO (1) WO2002075929A1 (fi)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7028042B2 (en) * 2002-05-03 2006-04-11 Jorma Rissanen Lossless data compression system
CN113609344A (zh) * 2021-09-29 2021-11-05 北京泰迪熊移动科技有限公司 字节流状态机的构建方法及装置、电子设备、存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1995014350A1 (en) * 1993-11-15 1995-05-26 National Semiconductor Corporation Quadtree-structured walsh transform coding
US5534861A (en) * 1993-04-16 1996-07-09 International Business Machines Corporation Method and system for adaptively building a static Ziv-Lempel dictionary for database compression
WO1997036376A1 (en) * 1996-03-28 1997-10-02 Vxtreme, Inc. Table-based compression with embedded coding

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5534861A (en) * 1993-04-16 1996-07-09 International Business Machines Corporation Method and system for adaptively building a static Ziv-Lempel dictionary for database compression
WO1995014350A1 (en) * 1993-11-15 1995-05-26 National Semiconductor Corporation Quadtree-structured walsh transform coding
WO1997036376A1 (en) * 1996-03-28 1997-10-02 Vxtreme, Inc. Table-based compression with embedded coding

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GINESTA XAVIER ET AL.: "Vector quantization of contextual information for lossless image compression", DATA COMPRESSION CONFERENCE, 1995. DCC'94, 29 March 1994 (1994-03-29) - 31 March 1994 (1994-03-31), pages 390 - 399 *
SAUPE DIETMAR ET AL.: "Optimal hierarchical partitions for fractal image compression", INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, 1998. ICIP, PROCEEDINGS, vol. 1, 4 October 1998 (1998-10-04) - 7 October 1998 (1998-10-07), pages 737 - 741 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7028042B2 (en) * 2002-05-03 2006-04-11 Jorma Rissanen Lossless data compression system
CN113609344A (zh) * 2021-09-29 2021-11-05 北京泰迪熊移动科技有限公司 字节流状态机的构建方法及装置、电子设备、存储介质

Also Published As

Publication number Publication date
FI20010525A (fi) 2002-09-17
FI110374B (fi) 2002-12-31
FI20010525A0 (fi) 2001-03-16

Similar Documents

Publication Publication Date Title
Fiala et al. Data compression with finite windows
CA2321233C (en) Block-wise adaptive statistical data compressor
EP0695040B1 (en) Data compressing method and data decompressing method
US5532694A (en) Data compression apparatus and method using matching string searching and Huffman encoding
US5270712A (en) Sort order preserving method for data storage compression
Nelson et al. The data compression book 2nd edition
US5841376A (en) Data compression and decompression scheme using a search tree in which each entry is stored with an infinite-length character string
US5229768A (en) Adaptive data compression system
US6100824A (en) System and method for data compression
JP3258552B2 (ja) データ圧縮装置及びデータ復元装置
US7764202B2 (en) Lossless data compression with separated index values and literal values in output stream
JPH09162748A (ja) データ符号化方法、データ復号方法、データ圧縮装置、データ復元装置、及びデータ圧縮・復元システム
JPH09246991A (ja) データ圧縮・復元方法及びデータ圧縮装置及びデータ復元装置
US6304676B1 (en) Apparatus and method for successively refined competitive compression with redundant decompression
US5594435A (en) Permutation-based data compression
JPS6356726B2 (fi)
US20030212695A1 (en) Lossless data compression system
EP1266455A1 (en) Method and apparatus for optimized lossless compression using a plurality of coders
Chen et al. E cient Lossless Compression of Trees and Graphs
Rathore et al. A brief study of data compression algorithms
KR100494876B1 (ko) 2바이트 문자 데이터 압축 방법
WO2002075929A1 (en) A method for compressing data
Ghuge Map and Trie based Compression Algorithm for Data Transmission
Yokoo An adaptive data compression method based on context sorting
Senthil et al. Text compression algorithms: A comparative study

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ CZ DE DE DK DK DM DZ EC EE EE ES FI FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP