EP2076964A1 - Procédé et dispositif de compression et de décompression de données numériques par voie électronique en utilisant une grammaire contextuelle - Google Patents
Procédé et dispositif de compression et de décompression de données numériques par voie électronique en utilisant une grammaire contextuelleInfo
- Publication number
- EP2076964A1 EP2076964A1 EP07785674A EP07785674A EP2076964A1 EP 2076964 A1 EP2076964 A1 EP 2076964A1 EP 07785674 A EP07785674 A EP 07785674A EP 07785674 A EP07785674 A EP 07785674A EP 2076964 A1 EP2076964 A1 EP 2076964A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- grammar
- context
- digital data
- compression
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
Definitions
- the invention relates to a method and apparatus for compression and decompression of digital data by electronic means
- the compression of digital data by electronic means i. in an electronic system for information processing or data transmission, is used primarily to save storage space and transmission capacity.
- compression is important not only for efficiently utilizing existing transmission capacities, such as available bandwidth, but also for speeding up transmission. But even when storing large volumes of gigabyte or even terabyte digital data, such as databases, efficient compression is often required to reduce the storage space required for uncompressed digital data, thereby conserving technical resources can.
- identical symbol sequences are not stored multiple times in a symbol sequence to be compressed, but a reference to a first occurrence of a symbol sequence is established. The reference indicates how many symbols in the sequence must be returned and how long the sequence to be repeated is.
- the LZ78 algorithm creates a table of common symbol sequences. If such a symbol sequence occurs in a symbol sequence to be compressed, only the corresponding code has to be inserted from the table which is shorter than the symbol sequence itself.
- the LZW algorithm is a table-based compression method, which is formed by a predetermined table of 256 entries which is extended in the course of the compression process to the requirements of the symbol sequence to be compressed If the existing symbol sequence appears in the symbol sequence to be compressed, then the table index can be stored in its place
- the LZW algorithm is used, for example, in data compression in the case of modems and in Computer systems used when storing GIF and TIFF files.
- U.S. Patent No. 4,558,302 describes the LZW algorithm in detail.
- the window-based methods are disadvantageous in that only text passages whose distance from one another is smaller than the window width can be linked to one another.
- the object of the invention is to propose an improved method and a device for the compression and decompression of digital data by electronic means with which short, redundancy-related data can be efficiently and quickly compressed or decompressed.
- This object is achieved by a method for compression and decompression of digital data by electronic means using a context grammar with the features of claim 1, a computer program having the features of claim 11, a computer program product having the features of claim 12, and an apparatus having the features of claim 13 solved.
- the invention further relates to various uses of the method according to the invention as specified in claims 14, 16 and 18.
- the object is achieved by a method for compression and decompression of digital data on electronic
- Paths using a context grammar characterized by the steps of grammatically compressing first digital data by searching multiple occurrence sequences of non-decomposable terminal symbols (V_T) in the first digital data to be compressed, replacing found multiple occurrences
- the step of generating a grammar preferably takes place in such a way that, as a derivation, an image for each symbol from the set of nonterminal symbols (V_N) on a symbol from the set of nonterminal symbols (V_N) combined with the set of terminal symbols (V_T).
- a step of generating a start symbol (SO) whose derivative corresponds to a text to be compressed is executed.
- the second digital data is similar to the first digital data.
- these rules are stored in a tree structure, wherein the tree structure can be expandable with new rules obtained from the second digital data.
- the tree structure is preferably traversed symbol by symbol in ascending order and is hereby searched for a grammatical rule corresponding to a longest prefix, for which a tree path exists starting from its root.
- the generated grammar is additionally coded arithmetically or using a Huffman code.
- a computer program for compressing and decompressing digital data electronically using a context grammar having the above embodiment achieves the Task-based solution when running on a data processing system such as a computer.
- Such a computer program is preferably designed as a computer program product and comprises a machine-readable data carrier on which the computer program is stored in the form of electronically or optically readable control signals for a computer.
- An apparatus for compression and decompression of digital data by electronic means using a context grammar with an input device, a processing device, a memory device and an output device for carrying out the aforementioned method is used for practicing the method according to the invention.
- the inventive method for compression and decompression of digital data by electronic means using a context grammar is particularly efficient in compressing datasets of databases, in particular relational, object-oriented and XML-based databases.
- a contextual grammar may be created for a table column, and contextual grammar may then be used to compress the column entries.
- inventive method for compression and decompression of digital data by electronic means using a context grammar for compressing a data transmission in particular a punk-to-point connection.
- This can increase the effective bandwidth of a data connection.
- the relatively short data packets, as they often occur in data transmissions, are suitable for context compression.
- packet structures of digital to be transmitted can Data is compressed prior to data transmission using context grammar present at both transmission points.
- An essential idea underlying embodiments of the invention is thus that, upon compression of first data, information is gained which can be used to efficiently compress second data similar to the first data. In other words, the information obtained from the first data can be used efficiently.
- a context grammar is generated, which is then usable for compression of the second and further data.
- information is obtained, which is then used to compress second data.
- the grammar generated in the compression of the second data contains, in particular, a special rule, which is also referred to below as the start rule and whose expansion corresponds to the data to be compressed. While this starting rule is generally characteristic of the particular data set to be compressed, other rules which are "inserted" into the starting rule following the context grammar are more general in nature.
- the information obtained from similar data is thus used as the basis for generating the grammar currently available for compression compressive, additional data is applied.
- the symbols of the grammar can then be coded, for example by means of Huffman codes or arithmetically.
- the amount of information that should be used for the context grammar in the simplest way, for example, depending on the application, data type and
- Amount of data to be chosen flexibly.
- the context information can be directly extracted from similar data by first compressing that data and creating it for it during that
- the invention allows for more flexible coding possibilities, since the code of a grammar newly created for other data is independent of the context grammar code for the previous ones compressed data can be created and used. This results in each additional, advantageous options for further optimization.
- the greatest advantage of the method and the device according to the invention thus lies in particular in an efficient compression of small or short data sets which are not or substantially less efficiently compressible with the known compression methods. This results in significant advantages for the storage, transmission and processing of data for applications for such data sets.
- V ⁇ be the alphabet used in the data to be compressed, for example the set of 256 possible character values or symbols, for example those of the extended ASCII code, which can be coded with one byte.
- the elements of V ⁇ are called terminals and indicate those symbols that can not be decomposed further.
- the grammar to be generated for compression is then described by a set VN of non-terminal symbols, ie variables, a special start rule So and derivation rules Si to S n .
- the derivation rules Si to S n each contain a nonterminal symbol on the left side and at least 2 symbols of V ⁇ on the right side combine VN.
- a short example should clarify this.
- the text ABAB should be compressed, where A and B are elements of V 1 - terminals that can not be further broken down.
- the context-free grammar to be generated for the data to be compressed can also be obtained by means of so-called context compression.
- context compression a plurality of (base) rules K 1 to K n are either predefined or used from a previously created grammar, which can then be referenced to generate a new, context-free grammar from the data currently being compressed.
- the rules of the context grammar K 1 to K n can thus be used both for creating new rules and in the start rule S 0 .
- Storage of the grammar uses a code in which frequent symbols are assigned shorter codewords than rare symbols.
- a Huffman code can be used for this purpose.
- a first possibility consists first of all that the codewords of the context grammar continue to be used. In this case, the entire context grammar is stored coded so that the codeword lengths used reflect the frequencies or frequency of occurrence of the corresponding expanded rules. Assuming that the data to be compressed is of the same type as that, ie, similar to the data for generating the context grammar, the frequencies in the data to be compressed behave similarly to the frequencies in the generation of the context grammar. It is therefore advantageous to continue to use the codewords from the context grammar for coding the context rules.
- two codes are used in parallel in connection with the aforementioned first possibility, i. h., in addition to the codewords that are still used, a separate code is also generated for the newly generated, record-specific rules. To store the compressed data then further used codewords from the context grammar and codewords are used from this newly generated code.
- code the next codeword belongs i) For example, in one of the two codes there are otherwise unused code symbols which are used to identify one or more code words of the other code, or ii) there is an otherwise unused code in both codes
- Context grammar contains unused wildcard codewords that can be used for newly created rules.
- a common code is generated both for the further used rules of the context grammar and for the newly created rules.
- the assignment to a new code word must be possible for a used context rule. This can be done, for example, by defining the corresponding new one
- Code words are given to the context grammar rule associated codeword.
- the establishment of the association with the new code word is not limited to the above modes, but may be appropriately selected according to the characteristics of the data to be compressed in order to achieve the best possible compression.
- first digital data is first of all grammatically compressed.
- V_T be the set of symbols used in the first digital data.
- this data for example a text
- multiple sequences of terminal symbols V_T i. not further separable symbols or characters, searched.
- Found symbols V_T are then replaced by a nonterminal symbol, i. a symbol that can be further decomposed according to rules, and a partial data sequence belonging to this symbol, for example a subtext, is stored in a grammar containing rules. This results in a set of nonterminal symbols V_N.
- the resulting grammar indicates A from the set V_N to which symbols from V_N combined V_T are mapped. This is also referred to as derivative of (symbol) A.
- Context compression uses similar, second digital data with the predetermined, generated from the first digital data grammar compressed. If the grammar generated from the first digital data was stored in a different way, the amount of data to be stored for the compressed second digital data is advantageously reduced.
- the second digital data can be compressed immediately.
- the generation of the grammar can be done in various ways, for example according to the methods Sequential, Sequitur, or Repair.
- the Sequential example below describes how a grammar can be used efficiently as context grammar and read in such a way that it can be used with little computational effort.
- expansions of these rules are preferably stored in a tree.
- a node of such a tree corresponds in this case to a data string or a string, and branches branching off from such a node correspond to the continuations according to the grammar rules of a tree
- Such a tree can be extended by inserting new grammar rules by inserting from the root of the tree a data string corresponding to an expanded grammar rule into the tree. Now that all rules of the grammar are inserted into the tree, this tree can be used for context compression.
- underlying text is traversed from front to back, with the aim of finding the grammatical rule that corresponds to the longest possible prefix of the text. In other words, this searches for the longest prefix of the text for which there is a path within the tree from its root. This is efficiently possible because at each node there is at most one corresponding branch for each letter.
- the nodes of such a path may fully comply with grammar rules, or only correspond to a part of a rule.
- the longest prefix corresponds to the last node of a path that corresponds to a rule.
- this rule can be applied and the underlying algorithm continues after the data string that matches the rule. If no rule is found, the first terminal symbol of the text to be compressed is used and the algorithm applied to the text following it.
- Another class of compression method compresses the column entries individually. However, in the case of short database entries considered here, these lead to a maximum of low compression.
- the compression methods used in known databases such as Oracle or IBM DB2 are fundamentally different: The compression method used in Oracle works locally on memory pages. So some lines of the table are always compressed at once. With the method according to the invention, however, the entries of an entire column are compressed.
- the compression used in IBM DB2 uses a global dictionary with a codeword length of 12 bits.
- Advantages of context compression according to the method according to the invention are the variable code word length and the possibility that partial strings can also be compressed. Zwasr can be compressed with Oracle and other databases and individual database entries, for example, with LZ77. This is worthwhile only for longer entries, which are redundant. In the scope of context compression (columns with short entries, where the entries of a column contain redundant parts), this type of compression can not be used profitably.
- context compression Another application of the context compression described above is the compression of point-to-point connections in data transfers to increase the effective bandwidth of such connections. Relatively short data packets, such as frequently occur during data transmission, are particularly suitable for context compression. In contrast to the known standard methods, which can only use the relatively low redundancy in a packet, context compression allows highly efficient compression of typical packet structures.
- the proposed context compression may also be designed to be adaptive such that rules within context grammars are synchronous or renewable at sender and receiver.
- context compression is the compression of small files that can be individually compressed only slightly, for example, when storing many small files of the same type, using contextual grammar advantageous applicable. Examples include XML formatted order forms and other records of similar structure and structure.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE102006047465A DE102006047465A1 (de) | 2006-10-07 | 2006-10-07 | Verfahren und Vorrichtung zur Kompression und Dekompression digitaler Daten auf elektronischem Wege unter Verwendung einer Kontextgrammatik |
PCT/DE2007/001311 WO2008040267A1 (fr) | 2006-10-07 | 2007-07-24 | Procédé et dispositif de compression et de décompression de données numériques par voie électronique en utilisant une grammaire contextuelle |
Publications (1)
Publication Number | Publication Date |
---|---|
EP2076964A1 true EP2076964A1 (fr) | 2009-07-08 |
Family
ID=38740471
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP07785674A Ceased EP2076964A1 (fr) | 2006-10-07 | 2007-07-24 | Procédé et dispositif de compression et de décompression de données numériques par voie électronique en utilisant une grammaire contextuelle |
Country Status (4)
Country | Link |
---|---|
US (1) | US20100312755A1 (fr) |
EP (1) | EP2076964A1 (fr) |
DE (1) | DE102006047465A1 (fr) |
WO (1) | WO2008040267A1 (fr) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8495034B2 (en) * | 2010-12-30 | 2013-07-23 | Teradata Us, Inc. | Numeric, decimal and date field compression |
US11995058B2 (en) * | 2022-07-05 | 2024-05-28 | Sap Se | Compression service using FPGA compression |
Family Cites Families (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4558302A (en) * | 1983-06-20 | 1985-12-10 | Sperry Corporation | High speed data compression and decompression apparatus and method |
JP3273119B2 (ja) * | 1995-09-29 | 2002-04-08 | 京セラ株式会社 | データ圧縮・伸長装置 |
US6006232A (en) * | 1997-10-21 | 1999-12-21 | At&T Corp. | System and method for multirecord compression in a relational database |
US6489902B2 (en) * | 1997-12-02 | 2002-12-03 | Hughes Electronics Corporation | Data compression for use with a communications channel |
US6327699B1 (en) * | 1999-04-30 | 2001-12-04 | Microsoft Corporation | Whole program path profiling |
US6762699B1 (en) * | 1999-12-17 | 2004-07-13 | The Directv Group, Inc. | Method for lossless data compression using greedy sequential grammar transform and sequential encoding |
US6400289B1 (en) * | 2000-03-01 | 2002-06-04 | Hughes Electronics Corporation | System and method for performing lossless data compression and decompression |
JP4693292B2 (ja) * | 2000-09-11 | 2011-06-01 | 株式会社東芝 | 強磁性トンネル接合素子およびその製造方法 |
US8868544B2 (en) * | 2002-04-26 | 2014-10-21 | Oracle International Corporation | Using relational structures to create and support a cube within a relational database system |
US6801141B2 (en) * | 2002-07-12 | 2004-10-05 | Slipstream Data, Inc. | Method for lossless data compression using greedy sequential context-dependent grammar transform |
US20050273274A1 (en) * | 2004-06-02 | 2005-12-08 | Evans Scott C | Method for identifying sub-sequences of interest in a sequence |
US20060117307A1 (en) * | 2004-11-24 | 2006-06-01 | Ramot At Tel-Aviv University Ltd. | XML parser |
US7840774B2 (en) * | 2005-09-09 | 2010-11-23 | International Business Machines Corporation | Compressibility checking avoidance |
US7447865B2 (en) * | 2005-09-13 | 2008-11-04 | Yahoo ! Inc. | System and method for compression in a distributed column chunk data store |
US7403951B2 (en) * | 2005-10-07 | 2008-07-22 | Nokia Corporation | System and method for measuring SVG document similarity |
US7464247B2 (en) * | 2005-12-19 | 2008-12-09 | Yahoo! Inc. | System and method for updating data in a distributed column chunk data store |
US7921087B2 (en) * | 2005-12-19 | 2011-04-05 | Yahoo! Inc. | Method for query processing of column chunks in a distributed column chunk data store |
-
2006
- 2006-10-07 DE DE102006047465A patent/DE102006047465A1/de not_active Withdrawn
-
2007
- 2007-07-24 US US12/444,434 patent/US20100312755A1/en not_active Abandoned
- 2007-07-24 EP EP07785674A patent/EP2076964A1/fr not_active Ceased
- 2007-07-24 WO PCT/DE2007/001311 patent/WO2008040267A1/fr active Application Filing
Non-Patent Citations (1)
Title |
---|
See references of WO2008040267A1 * |
Also Published As
Publication number | Publication date |
---|---|
DE102006047465A1 (de) | 2008-04-10 |
US20100312755A1 (en) | 2010-12-09 |
WO2008040267A1 (fr) | 2008-04-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE10301362B4 (de) | Blockdatenkompressionssystem, bestehend aus einer Kompressionseinrichtung und einer Dekompressionseinrichtung, und Verfahren zur schnellen Blockdatenkompression mit Multi-Byte-Suche | |
DE60000912T2 (de) | Verfahren und Vorrichtung zur Datenkomprimierung von Netzwerkdatenpaketen unter Verwendung von paketweisen Hash Tabellen | |
DE60001210T2 (de) | Verfahren und Vorrichtung zur Datenkomprimierung von Netzwerkdatenpaketen | |
DE69704362T2 (de) | Datenkompressions-/dekompressionssystem anhand sofortiger zeichenfolgensucheverschachtelter wörterbuchaktualisierung | |
DE68925798T2 (de) | Datenverdichtung | |
DE69905343T2 (de) | Blockweiser adaptiver statistischer datenkompressor | |
DE10196890B4 (de) | Verfahren zum Ausführen einer Huffman-Decodierung | |
DE19622045C2 (de) | Datenkomprimierungs- und Datendekomprimierungsschema unter Verwendung eines Suchbaums, bei dem jeder Eintrag mit einer Zeichenkette unendlicher Länge gespeichert ist | |
DE69528152T2 (de) | Datenkompressionsvorrichtung, Datenexpansionsvorrichtung und System zur Datenkompression und Expansion | |
DE4340591C2 (de) | Datenkompressionsverfahren unter Verwendung kleiner Wörterbücher zur Anwendung auf Netzwerkpakete | |
DE69026924T2 (de) | Datenkomprimierungsverfahren | |
DE69318446T2 (de) | Verfahren und Vorrichtung zur Datenkompression und -dekompression für eine Übertragungsanordnung | |
DE69027606T2 (de) | Vorrichtung zur datenkompression | |
DE69522497T2 (de) | System und Verfahren zur Datenkompression | |
DE69833094T2 (de) | Verfahren und Vorrichtung zur adaptiven Datenkompression mit höherem Kompressionsgrad | |
DE60000380T2 (de) | Verfahren und Vorrichtung zur Datenkompression | |
EP2296282B1 (fr) | Procédé et agencement d'encodage et de décodage arithmétiques utilisant plusieurs tables de consultation | |
DE69834695T2 (de) | Verfahren und Vorrichtung zur Datenkompression | |
DE2801988A1 (de) | Arithmetische codierung von symbolfolgen | |
DE3485824T2 (de) | Verfahren zur datenkompression. | |
DE69025160T2 (de) | Verfahren zur Dekodierung komprimierter Daten | |
DE69021854T2 (de) | Verfahren zur Dekomprimierung von komprimierten Daten. | |
EP2076964A1 (fr) | Procédé et dispositif de compression et de décompression de données numériques par voie électronique en utilisant une grammaire contextuelle | |
DE68927939T2 (de) | Monadische Kodierung vom Start-Schritt-Stop-Typ für die Datenkomprimierung | |
DE19653133C2 (de) | System und Verfahren zur pre-entropischen Codierung |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20090507 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC MT NL PL PT RO SE SI SK TR |
|
RIN1 | Information on inventor provided before grant (corrected) |
Inventor name: HILDEBRANDT, ERIC Inventor name: BOKLER, MARTIN |
|
17Q | First examination report despatched |
Effective date: 20090924 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R003 |
|
DAX | Request for extension of the european patent (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED |
|
18R | Application refused |
Effective date: 20120707 |