EP2076964A1

EP2076964A1 - Method and apparatus for compressing and decompressing digital data by electronic means using a context grammar

Info

Publication number: EP2076964A1
Application number: EP07785674A
Authority: EP
Inventors: Eric Hildebrandt; Martin Bokler
Original assignee: Deutsche Telekom AG
Current assignee: Deutsche Telekom AG
Priority date: 2006-10-07
Filing date: 2007-07-24
Publication date: 2009-07-08
Also published as: DE102006047465A1; US20100312755A1; WO2008040267A1

Abstract

The invention relates to a method for compressing and decompressing digital data by electronic means using a context grammar, characterized by the steps of: grammatically compressing first digital data by means of searching for repeated sequences of terminal symbols (V_T) which cannot be broken down any further in first digital data to be compressed; replacing the identified repeated sequences of terminal symbols (V_T) which cannot be broken down any further with non-terminal symbols (V_N) which can be broken down further; storing the digital data belonging to these non-terminal symbols (V_N) in an associated context grammar; and carrying out context compression, with which second digital data are compressed using this context grammar which has been produced from the first digital data.

Description

DESCRIPTION

Method and device for compression and decompression of digital data by electronic means using a

context grammar

The invention relates to a method and apparatus for compression and decompression of digital data by electronic means

Use of a context grammar, and more particularly relates to a method and system for high-efficiency and fast, lossless compression of data for short, redundant data sets.

The compression of digital data by electronic means, i. in an electronic system for information processing or data transmission, is used primarily to save storage space and transmission capacity. In particular, when larger amounts of digital data are transmitted over data networks, compression is important not only for efficiently utilizing existing transmission capacities, such as available bandwidth, but also for speeding up transmission. But even when storing large volumes of gigabyte or even terabyte digital data, such as databases, efficient compression is often required to reduce the storage space required for uncompressed digital data, thereby conserving technical resources can.

For the lossless compression of data (data compression) the algorithms of Huffmann and of Ziv and Lempel (LZ) are often used. For example, those are widely used after their Publication year designated algorithms LZ77 and LZ78, in the articles "A Universal Algorithm for Sequential Data Compression", J. Ziv, A. Lempel, IEEE Transactions on Information Theory 23 (1977), pp. 337-343, and "Compression of Individual Sequences via Variable Length Coding ", J. Ziv, A. Lempel, IEEE Transactions on Information Theory 24 (1978), pp. 530-536. The Huffman algorithm is described in the article "A Method for the Construction of Minimum Redundancy Codes," Huffman, DA, Proceedings of the Institute of Radio Engineers, Sept. 1952, Vol. 40, No. 9, pp. 1098-1101, described.

In the LZ77 algorithm, identical symbol sequences are not stored multiple times in a symbol sequence to be compressed, but a reference to a first occurrence of a symbol sequence is established. The reference indicates how many symbols in the sequence must be returned and how long the sequence to be repeated is. The LZ78 algorithm creates a table of common symbol sequences. If such a symbol sequence occurs in a symbol sequence to be compressed, only the corresponding code has to be inserted from the table which is shorter than the symbol sequence itself.

A further development of the LZ78 algorithm is the LZW algorithm described in the article "A Technique for High-Performance Data Compression", Welch, TA, IEEE Computer, Vol. 17, No. 6 (1984), pp. 8-19 The LZW algorithm, like the LZ78 algorithm, is a table-based compression method, which is formed by a predetermined table of 256 entries which is extended in the course of the compression process to the requirements of the symbol sequence to be compressed If the existing symbol sequence appears in the symbol sequence to be compressed, then the table index can be stored in its place The LZW algorithm is used, for example, in data compression in the case of modems and in Computer systems used when storing GIF and TIFF files. U.S. Patent No. 4,558,302 describes the LZW algorithm in detail.

The aforementioned algorithms are all window-based

Compression methods in which, due to limited resources such as memory constraints, a so-called window with a predetermined width is moved over the data to be compressed and the data lying within the window is compressed. In this case, the windows used in the algorithms can be initialized, so that sequences of the data to be compressed, which occur in this initialization, can be cited directly on the first occurrence and thus a compression is achieved.

The window-based methods are disadvantageous in that only text passages whose distance from one another is smaller than the window width can be linked to one another.

For grammatical compression of digital data, the following algorithms are also recognized:

Sequitur: described in "identifying hierarchical structure in sequences: A linear-time algorithm", C. Nevill-Mannig, I. Witten, Journal of Artificial

Intelligence Research, 7: 67-82, 1997; and

Repair: Offline dictionary-based compression ", N.J. Larsson, A. Moffat, Proceedings of the IEEE, vol. 88, no. 11, pp. 1722-1732

The object of the invention is to propose an improved method and a device for the compression and decompression of digital data by electronic means with which short, redundancy-related data can be efficiently and quickly compressed or decompressed. This object is achieved by a method for compression and decompression of digital data by electronic means using a context grammar with the features of claim 1, a computer program having the features of claim 11, a computer program product having the features of claim 12, and an apparatus having the features of claim 13 solved. The invention further relates to various uses of the method according to the invention as specified in claims 14, 16 and 18.

Preferred embodiments of the invention will become apparent from the dependent claims.

Thus, the object is achieved by a method for compression and decompression of digital data on electronic

Paths using a context grammar characterized by the steps of grammatically compressing first digital data by searching multiple occurrence sequences of non-decomposable terminal symbols (V_T) in the first digital data to be compressed, replacing found multiple occurrences

Sequences of non-expandable terminal symbols (V_T) by further decomposable nonterminal symbols (V_N), memories of the digital data belonging to these nonterminal symbols (V_N) in an associated context grammar; and performing context compression by which second digital data is compressed using this context grammar generated from the first digital data.

In this case, the step of generating a grammar preferably takes place in such a way that, as a derivation, an image for each symbol from the set of nonterminal symbols (V_N) on a symbol from the set of nonterminal symbols (V_N) combined with the set of terminal symbols (V_T).

More preferably, a step of generating a start symbol (SO) whose derivative corresponds to a text to be compressed is executed.

It may be advantageous here if the second digital data is similar to the first digital data.

Preferably, when reading in the rules of the generated grammar expansions, these rules are stored in a tree structure, wherein the tree structure can be expandable with new rules obtained from the second digital data.

For context compression, the tree structure is preferably traversed symbol by symbol in ascending order and is hereby searched for a grammatical rule corresponding to a longest prefix, for which a tree path exists starting from its root.

It may be advantageous if the context compression is searched for the most frequently occurring grammatical rules or the grammatical rules with the longest derivation.

To generate the grammar, algorithms according to sequitur, sequential or repair are preferably used.

It may also be advantageous if the generated grammar is additionally coded arithmetically or using a Huffman code.

A computer program for compressing and decompressing digital data electronically using a context grammar having the above embodiment achieves the Task-based solution when running on a data processing system such as a computer.

Such a computer program is preferably designed as a computer program product and comprises a machine-readable data carrier on which the computer program is stored in the form of electronically or optically readable control signals for a computer.

An apparatus for compression and decompression of digital data by electronic means using a context grammar, with an input device, a processing device, a memory device and an output device for carrying out the aforementioned method is used for practicing the method according to the invention.

The inventive method for compression and decompression of digital data by electronic means using a context grammar is particularly efficient in compressing datasets of databases, in particular relational, object-oriented and XML-based databases. For example, a contextual grammar may be created for a table column, and contextual grammar may then be used to compress the column entries.

Furthermore, the inventive method for compression and decompression of digital data by electronic means using a context grammar for compressing a data transmission, in particular a punk-to-point connection. This can increase the effective bandwidth of a data connection. The relatively short data packets, as they often occur in data transmissions, are suitable for context compression. In particular, packet structures of digital to be transmitted can Data is compressed prior to data transmission using context grammar present at both transmission points.

Finally, the inventive method for compression and decompression of digital data under electronic means

Use of a context grammar also advantageous for compression of a file or multiple files of the same type, in particular XML files used.

An essential idea underlying embodiments of the invention is thus that, upon compression of first data, information is gained which can be used to efficiently compress second data similar to the first data. In other words, the information obtained from the first data can be used efficiently.

More specifically, in the compression of the first data, a context grammar is generated, which is then usable for compression of the second and further data. In other words, in the compression of the first data, information is obtained, which is then used to compress second data.

The grammar generated in the compression of the second data contains, in particular, a special rule, which is also referred to below as the start rule and whose expansion corresponds to the data to be compressed. While this starting rule is generally characteristic of the particular data set to be compressed, other rules which are "inserted" into the starting rule following the context grammar are more general in nature. The information obtained from similar data is thus used as the basis for generating the grammar currently available for compression compressive, additional data is applied. For even further, improved compression, the symbols of the grammar can then be coded, for example by means of Huffman codes or arithmetically.

The invention is characterized by the following points:

1. The use of the fundamentally different working, based on a grammar compression method according to the invention has the significant advantage that rules regardless of their position in the grammar and the

Data can be used. On the other hand, in the window-based methods, as mentioned above, only text points whose distance is smaller than the window width can be linked together. This is especially true for large amounts of similar datasets, such as columns of

Databases occur, very unfavorable.

2. According to the invention, the amount of information that should be used for the context grammar, in the simplest way, for example, depending on the application, data type and

Amount of data to be chosen flexibly.

3. According to the invention, the context information can be directly extracted from similar data by first compressing that data and creating it for it during that

Grammar without start rule is used as context grammar for other data. This is done simultaneously and without additional effort and is therefore extremely efficient.

4. The invention allows for more flexible coding possibilities, since the code of a grammar newly created for other data is independent of the context grammar code for the previous ones compressed data can be created and used. This results in each additional, advantageous options for further optimization.

The greatest advantage of the method and the device according to the invention thus lies in particular in an efficient compression of small or short data sets which are not or substantially less efficiently compressible with the known compression methods. This results in significant advantages for the storage, transmission and processing of data for applications for such data sets.

From the following description of embodiments, there are further advantages and applications of the present invention.

First, the compression of data will be described by generating a context-free grammar according to the invention.

Let V _{τ be} the alphabet used in the data to be compressed, for example the set of 256 possible character values or symbols, for example those of the extended ASCII code, which can be coded with one byte. The elements of V _τ are called terminals and indicate those symbols that can not be decomposed further.

The grammar to be generated for compression is then described by a set VN of non-terminal symbols, ie variables, a special start rule So and derivation rules Si to S _n . The derivation rules Si to S _n each contain a nonterminal symbol on the left side and at least 2 symbols of V _τ on the right side combine VN. A short example should clarify this. For example, the text ABAB should be compressed, where A and B are elements of V ₁ - terminals that can not be further broken down. Now a rule Si with the rule or grammar S ₁ - ^ AB

generates the start rule for the compressed text

Si | so "" ^ S

and the grammar Si -> AB, which in this example only contains the mapping rule for S ₁ on AB.

The context-free grammar to be generated for the data to be compressed can also be obtained by means of so-called context compression. In context compression, a plurality of (base) rules K ₁ to K _{n are} either predefined or used from a previously created grammar, which can then be referenced to generate a new, context-free grammar from the data currently being compressed. The rules of the context grammar K ₁ to K _n can thus be used both for creating new rules and in the start rule S ₀ .

After compression by means of the context-free grammar is to further improve this first compression then to

Storage of the grammar uses a code in which frequent symbols are assigned shorter codewords than rare symbols. For example, a Huffman code can be used for this purpose.

In the case of context compression, there are also various possibilities for coding, in particular the rules of the context grammar. 1. A first possibility consists first of all that the codewords of the context grammar continue to be used. In this case, the entire context grammar is stored coded so that the codeword lengths used reflect the frequencies or frequency of occurrence of the corresponding expanded rules. Assuming that the data to be compressed is of the same type as that, ie, similar to the data for generating the context grammar, the frequencies in the data to be compressed behave similarly to the frequencies in the generation of the context grammar. It is therefore advantageous to continue to use the codewords from the context grammar for coding the context rules.

If additional new rules are generated, then there must be codewords for these rules, which are used when coding the

Context grammar have not yet been used. There are various possibilities for this:

a) According to one possibility, two codes are used in parallel in connection with the aforementioned first possibility, i. h., in addition to the codewords that are still used, a separate code is also generated for the newly generated, record-specific rules. To store the compressed data then further used codewords from the context grammar and codewords are used from this newly generated code.

It can be determined in various ways to which code the next codeword belongs: i) For example, in one of the two codes there are otherwise unused code symbols which are used to identify one or more code words of the other code, or ii) there is an otherwise unused code in both codes

Codeword used to switch to the other code.

b) According to another possibility in connection with the above first possibility are in the code for the

Context grammar contains unused wildcard codewords that can be used for newly created rules.

2. According to a second possibility, a common code is generated both for the further used rules of the context grammar and for the newly created rules. For this purpose, the assignment to a new code word must be possible for a used context rule. This can be done, for example, by defining the corresponding new one

Code words are given to the context grammar rule associated codeword.

The establishment of the association with the new code word is not limited to the above modes, but may be appropriately selected according to the characteristics of the data to be compressed in order to achieve the best possible compression.

The process according to the invention will be described in more detail below. Based on the idea underlying the invention that information obtained in the compression of first digital data is used to compress second, similar digital data, the first digital data is first of all grammatically compressed.

Let V_T be the set of symbols used in the first digital data. During compression, in this data, for example a text, multiple sequences of terminal symbols V_T, i. not further separable symbols or characters, searched. Found symbols V_T are then replaced by a nonterminal symbol, i. a symbol that can be further decomposed according to rules, and a partial data sequence belonging to this symbol, for example a subtext, is stored in a grammar containing rules. This results in a set of nonterminal symbols V_N.

In other words, for each symbol, the resulting grammar indicates A from the set V_N to which symbols from V_N combined V_T are mapped. This is also referred to as derivative of (symbol) A.

In particular, according to this method, there is a special symbol SO (start rule) whose derivation corresponds to the data sequence to be compressed. For example, to compress a text "a rose is a rose is a rose" may be represented by the following grammar:

A -> a rose

B -> is A

SO - »FIG

Then a context compression is performed. Context compression uses similar, second digital data with the predetermined, generated from the first digital data grammar compressed. If the grammar generated from the first digital data was stored in a different way, the amount of data to be stored for the compressed second digital data is advantageously reduced.

For example, if the first digital data has been compressed and stored, and second digital data similar to that first digital data is compressed and stored, using the grammar generated for the first digital data will already contain a plurality of rules based on the second digital data can be applied. In this way, the second digital data can be compressed immediately.

The generation of the grammar can be done in various ways, for example according to the methods Sequential, Sequitur, or Repair. The Sequential example below describes how a grammar can be used efficiently as context grammar and read in such a way that it can be used with little computational effort.

When reading in the grammar rules, expansions of these rules are preferably stored in a tree. A node of such a tree corresponds in this case to a data string or a string, and branches branching off from such a node correspond to the continuations according to the grammar rules of a tree

Data string, wherein in the case of, for example, text characters each two branches differ in their first letter.

Such a tree can be extended by inserting new grammar rules by inserting from the root of the tree a data string corresponding to an expanded grammar rule into the tree. Now that all rules of the grammar are inserted into the tree, this tree can be used for context compression.

In one example, underlying text is traversed from front to back, with the aim of finding the grammatical rule that corresponds to the longest possible prefix of the text. In other words, this searches for the longest prefix of the text for which there is a path within the tree from its root. This is efficiently possible because at each node there is at most one corresponding branch for each letter.

The nodes of such a path may fully comply with grammar rules, or only correspond to a part of a rule. In this context, the longest prefix corresponds to the last node of a path that corresponds to a rule. Thus, this rule can be applied and the underlying algorithm continues after the data string that matches the rule. If no rule is found, the first terminal symbol of the text to be compressed is used and the algorithm applied to the text following it.

An alternative way of context compression, in contrast to the above, is to look for the most common rules, which in some cases can further reduce the storage space required for the resulting compressed file.

Hereinafter, effects and advantages resulting from applications described above will be described by way of example. In databases, for example, entries are predominantly relatively short and highly redundant over an entire column of a database table. In this case, creating a context grammar for such a column and compressing the column with this context grammar can achieve significantly good compression.

In contrast to known database compression methods, this can be compressed globally on the column. In addition, parts of column entries are advantageously compressed in comparison with known table compression methods which compress only entire entries in each case. Through a corresponding recursive grammar, in which symbols refer to other symbols, until finally the terminals are reached, this can be achieved by an outstanding compression.

Another class of compression method compresses the column entries individually. However, in the case of short database entries considered here, these lead to a maximum of low compression.

The compression methods used in known databases such as Oracle or IBM DB2 are fundamentally different: The compression method used in Oracle works locally on memory pages. So some lines of the table are always compressed at once. With the method according to the invention, however, the entries of an entire column are compressed. The compression used in IBM DB2 uses a global dictionary with a codeword length of 12 bits. Advantages of context compression according to the method according to the invention, on the other hand, are the variable code word length and the possibility that partial strings can also be compressed. Zwasr can be compressed with Oracle and other databases and individual database entries, for example, with LZ77. This is worthwhile only for longer entries, which are redundant. In the scope of context compression (columns with short entries, where the entries of a column contain redundant parts), this type of compression can not be used profitably.

Another application of the context compression described above is the compression of point-to-point connections in data transfers to increase the effective bandwidth of such connections. Relatively short data packets, such as frequently occur during data transmission, are particularly suitable for context compression. In contrast to the known standard methods, which can only use the relatively low redundancy in a packet, context compression allows highly efficient compression of typical packet structures.

Moreover, by referring to, for example, one or two contextual grammars (s) in the packets already present at the two endpoints of a point-to-point connection, each for one round-trip direction, only the contextual grammars contained in the contextual grammars are used Rules referenced. This differs drastically from the conventional methods, in which all the necessary information must be contained in the respective packet, whereby the compression is further deteriorated.

The proposed context compression may also be designed to be adaptive such that rules within context grammars are synchronous or renewable at sender and receiver.

Also in the field of data storage, context compression is the compression of small files that can be individually compressed only slightly, for example, when storing many small files of the same type, using contextual grammar advantageous applicable. Examples include XML formatted order forms and other records of similar structure and structure.

Claims

PATENT APPLICATIONS

A method of compressing and decompressing digital data electronically using a context grammar characterized by the steps of:

- grammatically compressing first digital data by searching multiple occurrence sequences of non-decomposable terminal symbols (V_T) in the first digital data to be compressed;

Replacement of found, occurring multiple sequences of not further decomposable terminal symbols (V_T) by further decomposable non-terminal symbols (V_N);

- storing the digital data associated with these nonterminal symbols (V_N) in an associated context grammar; and

Performing context compression with which second digital data is compressed using this context grammar generated from the first digital data.

A method according to claim 1, characterized by the step of generating a grammar such that as a derivative, a map for each symbol from the set of nonterminal symbols (V_N) is combined to one of the set of nonterminal symbols (V_N) with the set of terminal symbols (V_T).

3. The method according to claim 2, characterized by the step of generating a start symbol (SO) whose derivative corresponds to a text to be compressed.

4. The method according to any one of the preceding claims, characterized in that when reading the rules of the generated Grammar expansions of these rules can be stored in a tree structure.

5. The method according to claim 4, characterized in that the tree structure with new rules, which were obtained from the second digital data, expandable.

6. The method according to claim 5, characterized in that for context compression, the tree structure to go through symbol by symbol and looking for a longest prefix corresponding grammatical rule for which a tree path is available from its root.

7. The method according to claim 5, characterized in that the context compression is searched for the most frequently occurring grammar rules or the grammatical rules with the longest derivative.

8. The method according to any one of the preceding claims, characterized in that are used to generate the grammar algorithms according to Sequential, Sequitur or Repair.

9. The method according to any one of the preceding claims, characterized in that the generated grammar is additionally arithmetic coded.

10. The method according to any one of claims 1 to 8, characterized in that the coding is carried out using a Huffman code.

11. Computer program for compression and decompression of digital data by electronic means using a

Context grammar according to a method of claims 1 to 10, when it running on a data processing system such as a computer.

12. Computer program product comprising a machine-readable data carrier on which a computer program according to claim 11 is stored in the form of electronically or optically readable control signals for a computer.

13. A device for compression and decompression of digital data by electronic means using a context grammar, with an input device, a processing device, a memory device and an output device for carrying out the method according to one of claims 1 to 10.

14. Use of the method for compression and decompression of digital data by electronic means using a context grammar according to one of claims 1 to 10 for compressing data records of databases, in particular relational, object-oriented and XML-based databases.

15. Use according to claim 14, characterized in that a context grammar is created for a table column and the context entries are used to compress the column entries.

16. Use of the method for compression and decompression of digital data by electronic means using a context grammar according to one of claims 1 to 10 for compressing a data transmission, in particular a punk-to-point connection.

17. Use according to claim 16, characterized in that packet structures of digital data to be transmitted before the data transmission using one at both Transfer points existing context grammar are compressed.

18. Use of the method for compression and decompression of digital data by electronic means using a context grammar according to one of claims 1 to 10 for the compression of one or more files of the same type, in particular XML files.