EP2076964A1 - Procédé et dispositif de compression et de décompression de données numériques par voie électronique en utilisant une grammaire contextuelle - Google Patents

Procédé et dispositif de compression et de décompression de données numériques par voie électronique en utilisant une grammaire contextuelle

Info

Publication number
EP2076964A1
EP2076964A1 EP07785674A EP07785674A EP2076964A1 EP 2076964 A1 EP2076964 A1 EP 2076964A1 EP 07785674 A EP07785674 A EP 07785674A EP 07785674 A EP07785674 A EP 07785674A EP 2076964 A1 EP2076964 A1 EP 2076964A1
Authority
EP
European Patent Office
Prior art keywords
grammar
context
digital data
compression
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP07785674A
Other languages
German (de)
English (en)
Inventor
Eric Hildebrandt
Martin Bokler
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Deutsche Telekom AG
Original Assignee
Deutsche Telekom AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Deutsche Telekom AG filed Critical Deutsche Telekom AG
Publication of EP2076964A1 publication Critical patent/EP2076964A1/fr
Ceased legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Definitions

  • the invention relates to a method and apparatus for compression and decompression of digital data by electronic means
  • the compression of digital data by electronic means i. in an electronic system for information processing or data transmission, is used primarily to save storage space and transmission capacity.
  • compression is important not only for efficiently utilizing existing transmission capacities, such as available bandwidth, but also for speeding up transmission. But even when storing large volumes of gigabyte or even terabyte digital data, such as databases, efficient compression is often required to reduce the storage space required for uncompressed digital data, thereby conserving technical resources can.
  • identical symbol sequences are not stored multiple times in a symbol sequence to be compressed, but a reference to a first occurrence of a symbol sequence is established. The reference indicates how many symbols in the sequence must be returned and how long the sequence to be repeated is.
  • the LZ78 algorithm creates a table of common symbol sequences. If such a symbol sequence occurs in a symbol sequence to be compressed, only the corresponding code has to be inserted from the table which is shorter than the symbol sequence itself.
  • the LZW algorithm is a table-based compression method, which is formed by a predetermined table of 256 entries which is extended in the course of the compression process to the requirements of the symbol sequence to be compressed If the existing symbol sequence appears in the symbol sequence to be compressed, then the table index can be stored in its place
  • the LZW algorithm is used, for example, in data compression in the case of modems and in Computer systems used when storing GIF and TIFF files.
  • U.S. Patent No. 4,558,302 describes the LZW algorithm in detail.
  • the window-based methods are disadvantageous in that only text passages whose distance from one another is smaller than the window width can be linked to one another.
  • the object of the invention is to propose an improved method and a device for the compression and decompression of digital data by electronic means with which short, redundancy-related data can be efficiently and quickly compressed or decompressed.
  • This object is achieved by a method for compression and decompression of digital data by electronic means using a context grammar with the features of claim 1, a computer program having the features of claim 11, a computer program product having the features of claim 12, and an apparatus having the features of claim 13 solved.
  • the invention further relates to various uses of the method according to the invention as specified in claims 14, 16 and 18.
  • the object is achieved by a method for compression and decompression of digital data on electronic
  • Paths using a context grammar characterized by the steps of grammatically compressing first digital data by searching multiple occurrence sequences of non-decomposable terminal symbols (V_T) in the first digital data to be compressed, replacing found multiple occurrences
  • the step of generating a grammar preferably takes place in such a way that, as a derivation, an image for each symbol from the set of nonterminal symbols (V_N) on a symbol from the set of nonterminal symbols (V_N) combined with the set of terminal symbols (V_T).
  • a step of generating a start symbol (SO) whose derivative corresponds to a text to be compressed is executed.
  • the second digital data is similar to the first digital data.
  • these rules are stored in a tree structure, wherein the tree structure can be expandable with new rules obtained from the second digital data.
  • the tree structure is preferably traversed symbol by symbol in ascending order and is hereby searched for a grammatical rule corresponding to a longest prefix, for which a tree path exists starting from its root.
  • the generated grammar is additionally coded arithmetically or using a Huffman code.
  • a computer program for compressing and decompressing digital data electronically using a context grammar having the above embodiment achieves the Task-based solution when running on a data processing system such as a computer.
  • Such a computer program is preferably designed as a computer program product and comprises a machine-readable data carrier on which the computer program is stored in the form of electronically or optically readable control signals for a computer.
  • An apparatus for compression and decompression of digital data by electronic means using a context grammar with an input device, a processing device, a memory device and an output device for carrying out the aforementioned method is used for practicing the method according to the invention.
  • the inventive method for compression and decompression of digital data by electronic means using a context grammar is particularly efficient in compressing datasets of databases, in particular relational, object-oriented and XML-based databases.
  • a contextual grammar may be created for a table column, and contextual grammar may then be used to compress the column entries.
  • inventive method for compression and decompression of digital data by electronic means using a context grammar for compressing a data transmission in particular a punk-to-point connection.
  • This can increase the effective bandwidth of a data connection.
  • the relatively short data packets, as they often occur in data transmissions, are suitable for context compression.
  • packet structures of digital to be transmitted can Data is compressed prior to data transmission using context grammar present at both transmission points.
  • An essential idea underlying embodiments of the invention is thus that, upon compression of first data, information is gained which can be used to efficiently compress second data similar to the first data. In other words, the information obtained from the first data can be used efficiently.
  • a context grammar is generated, which is then usable for compression of the second and further data.
  • information is obtained, which is then used to compress second data.
  • the grammar generated in the compression of the second data contains, in particular, a special rule, which is also referred to below as the start rule and whose expansion corresponds to the data to be compressed. While this starting rule is generally characteristic of the particular data set to be compressed, other rules which are "inserted" into the starting rule following the context grammar are more general in nature.
  • the information obtained from similar data is thus used as the basis for generating the grammar currently available for compression compressive, additional data is applied.
  • the symbols of the grammar can then be coded, for example by means of Huffman codes or arithmetically.
  • the amount of information that should be used for the context grammar in the simplest way, for example, depending on the application, data type and
  • Amount of data to be chosen flexibly.
  • the context information can be directly extracted from similar data by first compressing that data and creating it for it during that
  • the invention allows for more flexible coding possibilities, since the code of a grammar newly created for other data is independent of the context grammar code for the previous ones compressed data can be created and used. This results in each additional, advantageous options for further optimization.
  • the greatest advantage of the method and the device according to the invention thus lies in particular in an efficient compression of small or short data sets which are not or substantially less efficiently compressible with the known compression methods. This results in significant advantages for the storage, transmission and processing of data for applications for such data sets.
  • V ⁇ be the alphabet used in the data to be compressed, for example the set of 256 possible character values or symbols, for example those of the extended ASCII code, which can be coded with one byte.
  • the elements of V ⁇ are called terminals and indicate those symbols that can not be decomposed further.
  • the grammar to be generated for compression is then described by a set VN of non-terminal symbols, ie variables, a special start rule So and derivation rules Si to S n .
  • the derivation rules Si to S n each contain a nonterminal symbol on the left side and at least 2 symbols of V ⁇ on the right side combine VN.
  • a short example should clarify this.
  • the text ABAB should be compressed, where A and B are elements of V 1 - terminals that can not be further broken down.
  • the context-free grammar to be generated for the data to be compressed can also be obtained by means of so-called context compression.
  • context compression a plurality of (base) rules K 1 to K n are either predefined or used from a previously created grammar, which can then be referenced to generate a new, context-free grammar from the data currently being compressed.
  • the rules of the context grammar K 1 to K n can thus be used both for creating new rules and in the start rule S 0 .
  • Storage of the grammar uses a code in which frequent symbols are assigned shorter codewords than rare symbols.
  • a Huffman code can be used for this purpose.
  • a first possibility consists first of all that the codewords of the context grammar continue to be used. In this case, the entire context grammar is stored coded so that the codeword lengths used reflect the frequencies or frequency of occurrence of the corresponding expanded rules. Assuming that the data to be compressed is of the same type as that, ie, similar to the data for generating the context grammar, the frequencies in the data to be compressed behave similarly to the frequencies in the generation of the context grammar. It is therefore advantageous to continue to use the codewords from the context grammar for coding the context rules.
  • two codes are used in parallel in connection with the aforementioned first possibility, i. h., in addition to the codewords that are still used, a separate code is also generated for the newly generated, record-specific rules. To store the compressed data then further used codewords from the context grammar and codewords are used from this newly generated code.
  • code the next codeword belongs i) For example, in one of the two codes there are otherwise unused code symbols which are used to identify one or more code words of the other code, or ii) there is an otherwise unused code in both codes
  • Context grammar contains unused wildcard codewords that can be used for newly created rules.
  • a common code is generated both for the further used rules of the context grammar and for the newly created rules.
  • the assignment to a new code word must be possible for a used context rule. This can be done, for example, by defining the corresponding new one
  • Code words are given to the context grammar rule associated codeword.
  • the establishment of the association with the new code word is not limited to the above modes, but may be appropriately selected according to the characteristics of the data to be compressed in order to achieve the best possible compression.
  • first digital data is first of all grammatically compressed.
  • V_T be the set of symbols used in the first digital data.
  • this data for example a text
  • multiple sequences of terminal symbols V_T i. not further separable symbols or characters, searched.
  • Found symbols V_T are then replaced by a nonterminal symbol, i. a symbol that can be further decomposed according to rules, and a partial data sequence belonging to this symbol, for example a subtext, is stored in a grammar containing rules. This results in a set of nonterminal symbols V_N.
  • the resulting grammar indicates A from the set V_N to which symbols from V_N combined V_T are mapped. This is also referred to as derivative of (symbol) A.
  • Context compression uses similar, second digital data with the predetermined, generated from the first digital data grammar compressed. If the grammar generated from the first digital data was stored in a different way, the amount of data to be stored for the compressed second digital data is advantageously reduced.
  • the second digital data can be compressed immediately.
  • the generation of the grammar can be done in various ways, for example according to the methods Sequential, Sequitur, or Repair.
  • the Sequential example below describes how a grammar can be used efficiently as context grammar and read in such a way that it can be used with little computational effort.
  • expansions of these rules are preferably stored in a tree.
  • a node of such a tree corresponds in this case to a data string or a string, and branches branching off from such a node correspond to the continuations according to the grammar rules of a tree
  • Such a tree can be extended by inserting new grammar rules by inserting from the root of the tree a data string corresponding to an expanded grammar rule into the tree. Now that all rules of the grammar are inserted into the tree, this tree can be used for context compression.
  • underlying text is traversed from front to back, with the aim of finding the grammatical rule that corresponds to the longest possible prefix of the text. In other words, this searches for the longest prefix of the text for which there is a path within the tree from its root. This is efficiently possible because at each node there is at most one corresponding branch for each letter.
  • the nodes of such a path may fully comply with grammar rules, or only correspond to a part of a rule.
  • the longest prefix corresponds to the last node of a path that corresponds to a rule.
  • this rule can be applied and the underlying algorithm continues after the data string that matches the rule. If no rule is found, the first terminal symbol of the text to be compressed is used and the algorithm applied to the text following it.
  • Another class of compression method compresses the column entries individually. However, in the case of short database entries considered here, these lead to a maximum of low compression.
  • the compression methods used in known databases such as Oracle or IBM DB2 are fundamentally different: The compression method used in Oracle works locally on memory pages. So some lines of the table are always compressed at once. With the method according to the invention, however, the entries of an entire column are compressed.
  • the compression used in IBM DB2 uses a global dictionary with a codeword length of 12 bits.
  • Advantages of context compression according to the method according to the invention are the variable code word length and the possibility that partial strings can also be compressed. Zwasr can be compressed with Oracle and other databases and individual database entries, for example, with LZ77. This is worthwhile only for longer entries, which are redundant. In the scope of context compression (columns with short entries, where the entries of a column contain redundant parts), this type of compression can not be used profitably.
  • context compression Another application of the context compression described above is the compression of point-to-point connections in data transfers to increase the effective bandwidth of such connections. Relatively short data packets, such as frequently occur during data transmission, are particularly suitable for context compression. In contrast to the known standard methods, which can only use the relatively low redundancy in a packet, context compression allows highly efficient compression of typical packet structures.
  • the proposed context compression may also be designed to be adaptive such that rules within context grammars are synchronous or renewable at sender and receiver.
  • context compression is the compression of small files that can be individually compressed only slightly, for example, when storing many small files of the same type, using contextual grammar advantageous applicable. Examples include XML formatted order forms and other records of similar structure and structure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

L'invention concerne un procédé de compression et de décompression de données numériques par voie électronique en utilisant une grammaire contextuelle, caractérisé par les étapes suivantes consistant à : - Comprimer de manière grammaticale des premières données numériques en recherchant dans les premières données numériques à comprimer des séquences qui se répètent plusieurs fois de symboles (V_T) terminaux ne pouvant pas être décomposés davantage; - Remplacer des séquences trouvées qui se répètent plusieurs fois de symboles (V_T) terminaux ne pouvant pas être décomposés davantage par des symboles (V_N) non terminaux pouvant être décomposés davantage; - Enregistrer des données numériques appartenant à ces symboles (V_N) non terminaux dans une grammaire contextuelle associée; et Exécuter une compression contextuelle avec laquelle sont comprimées en utilisant cette grammaire contextuelle des deuxièmes données numériques qui ont été générées à partir des premières données numériques.
EP07785674A 2006-10-07 2007-07-24 Procédé et dispositif de compression et de décompression de données numériques par voie électronique en utilisant une grammaire contextuelle Ceased EP2076964A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE102006047465A DE102006047465A1 (de) 2006-10-07 2006-10-07 Verfahren und Vorrichtung zur Kompression und Dekompression digitaler Daten auf elektronischem Wege unter Verwendung einer Kontextgrammatik
PCT/DE2007/001311 WO2008040267A1 (fr) 2006-10-07 2007-07-24 Procédé et dispositif de compression et de décompression de données numériques par voie électronique en utilisant une grammaire contextuelle

Publications (1)

Publication Number Publication Date
EP2076964A1 true EP2076964A1 (fr) 2009-07-08

Family

ID=38740471

Family Applications (1)

Application Number Title Priority Date Filing Date
EP07785674A Ceased EP2076964A1 (fr) 2006-10-07 2007-07-24 Procédé et dispositif de compression et de décompression de données numériques par voie électronique en utilisant une grammaire contextuelle

Country Status (4)

Country Link
US (1) US20100312755A1 (fr)
EP (1) EP2076964A1 (fr)
DE (1) DE102006047465A1 (fr)
WO (1) WO2008040267A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8495034B2 (en) * 2010-12-30 2013-07-23 Teradata Us, Inc. Numeric, decimal and date field compression
US11995058B2 (en) * 2022-07-05 2024-05-28 Sap Se Compression service using FPGA compression

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4558302A (en) * 1983-06-20 1985-12-10 Sperry Corporation High speed data compression and decompression apparatus and method
JP3273119B2 (ja) * 1995-09-29 2002-04-08 京セラ株式会社 データ圧縮・伸長装置
US6006232A (en) * 1997-10-21 1999-12-21 At&T Corp. System and method for multirecord compression in a relational database
US6489902B2 (en) * 1997-12-02 2002-12-03 Hughes Electronics Corporation Data compression for use with a communications channel
US6327699B1 (en) * 1999-04-30 2001-12-04 Microsoft Corporation Whole program path profiling
US6762699B1 (en) * 1999-12-17 2004-07-13 The Directv Group, Inc. Method for lossless data compression using greedy sequential grammar transform and sequential encoding
US6400289B1 (en) * 2000-03-01 2002-06-04 Hughes Electronics Corporation System and method for performing lossless data compression and decompression
JP4693292B2 (ja) * 2000-09-11 2011-06-01 株式会社東芝 強磁性トンネル接合素子およびその製造方法
US8868544B2 (en) * 2002-04-26 2014-10-21 Oracle International Corporation Using relational structures to create and support a cube within a relational database system
US6801141B2 (en) * 2002-07-12 2004-10-05 Slipstream Data, Inc. Method for lossless data compression using greedy sequential context-dependent grammar transform
US20050273274A1 (en) * 2004-06-02 2005-12-08 Evans Scott C Method for identifying sub-sequences of interest in a sequence
US20060117307A1 (en) * 2004-11-24 2006-06-01 Ramot At Tel-Aviv University Ltd. XML parser
US7840774B2 (en) * 2005-09-09 2010-11-23 International Business Machines Corporation Compressibility checking avoidance
US7447865B2 (en) * 2005-09-13 2008-11-04 Yahoo ! Inc. System and method for compression in a distributed column chunk data store
US7403951B2 (en) * 2005-10-07 2008-07-22 Nokia Corporation System and method for measuring SVG document similarity
US7464247B2 (en) * 2005-12-19 2008-12-09 Yahoo! Inc. System and method for updating data in a distributed column chunk data store
US7921087B2 (en) * 2005-12-19 2011-04-05 Yahoo! Inc. Method for query processing of column chunks in a distributed column chunk data store

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2008040267A1 *

Also Published As

Publication number Publication date
DE102006047465A1 (de) 2008-04-10
US20100312755A1 (en) 2010-12-09
WO2008040267A1 (fr) 2008-04-10

Similar Documents

Publication Publication Date Title
DE10301362B4 (de) Blockdatenkompressionssystem, bestehend aus einer Kompressionseinrichtung und einer Dekompressionseinrichtung, und Verfahren zur schnellen Blockdatenkompression mit Multi-Byte-Suche
DE60000912T2 (de) Verfahren und Vorrichtung zur Datenkomprimierung von Netzwerkdatenpaketen unter Verwendung von paketweisen Hash Tabellen
DE60001210T2 (de) Verfahren und Vorrichtung zur Datenkomprimierung von Netzwerkdatenpaketen
DE69704362T2 (de) Datenkompressions-/dekompressionssystem anhand sofortiger zeichenfolgensucheverschachtelter wörterbuchaktualisierung
DE68925798T2 (de) Datenverdichtung
DE69905343T2 (de) Blockweiser adaptiver statistischer datenkompressor
DE10196890B4 (de) Verfahren zum Ausführen einer Huffman-Decodierung
DE19622045C2 (de) Datenkomprimierungs- und Datendekomprimierungsschema unter Verwendung eines Suchbaums, bei dem jeder Eintrag mit einer Zeichenkette unendlicher Länge gespeichert ist
DE69528152T2 (de) Datenkompressionsvorrichtung, Datenexpansionsvorrichtung und System zur Datenkompression und Expansion
DE4340591C2 (de) Datenkompressionsverfahren unter Verwendung kleiner Wörterbücher zur Anwendung auf Netzwerkpakete
DE69026924T2 (de) Datenkomprimierungsverfahren
DE69318446T2 (de) Verfahren und Vorrichtung zur Datenkompression und -dekompression für eine Übertragungsanordnung
DE69027606T2 (de) Vorrichtung zur datenkompression
DE69522497T2 (de) System und Verfahren zur Datenkompression
DE69833094T2 (de) Verfahren und Vorrichtung zur adaptiven Datenkompression mit höherem Kompressionsgrad
DE60000380T2 (de) Verfahren und Vorrichtung zur Datenkompression
EP2296282B1 (fr) Procédé et agencement d'encodage et de décodage arithmétiques utilisant plusieurs tables de consultation
DE69834695T2 (de) Verfahren und Vorrichtung zur Datenkompression
DE2801988A1 (de) Arithmetische codierung von symbolfolgen
DE3485824T2 (de) Verfahren zur datenkompression.
DE69025160T2 (de) Verfahren zur Dekodierung komprimierter Daten
DE69021854T2 (de) Verfahren zur Dekomprimierung von komprimierten Daten.
EP2076964A1 (fr) Procédé et dispositif de compression et de décompression de données numériques par voie électronique en utilisant une grammaire contextuelle
DE68927939T2 (de) Monadische Kodierung vom Start-Schritt-Stop-Typ für die Datenkomprimierung
DE19653133C2 (de) System und Verfahren zur pre-entropischen Codierung

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20090507

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC MT NL PL PT RO SE SI SK TR

RIN1 Information on inventor provided before grant (corrected)

Inventor name: HILDEBRANDT, ERIC

Inventor name: BOKLER, MARTIN

17Q First examination report despatched

Effective date: 20090924

REG Reference to a national code

Ref country code: DE

Ref legal event code: R003

DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20120707