DE102006028469B4

DE102006028469B4 - Adaptive Quantized Coding Method

Info

Publication number: DE102006028469B4
Application number: DE200610028469
Authority: DE
Inventors: Iakov Nekritch; Marek Karpinski
Original assignee: Karpinski Marek Prof Dr
Current assignee: Karpinski Marek Prof Dr
Priority date: 2006-06-21
Filing date: 2006-06-21
Publication date: 2010-12-02
Anticipated expiration: 2026-06-22
Also published as: DE102006028469A1

Abstract

Verfahren für die adaptive präfix-freie Codierung eines Datenstroms über einem Alphabet aus n Elementen mit zwei Parameter p und q, wobei p größer als q ist, gekennzeichnet dadurch, dass die Codewortlängen aufgrund von der Länge des bereits codierten Teils des Textes geteilt durch Parameter q und aufgerundet, weiterhin quantisierte Textlänge genannt, sowie der Häufigkeit des Symbols im bereits codierten Teil des Textes geteilt durch Parameter q und abgerundet, weiterhin quantisierte Häufigkeit genannt, bestimmt werden, und weist auf:
– eine Liste R von bereits codierten Symbolen
– ein Verfahren A für die Aktualisierung der Länge des codierten Symbols
– ein Verfahren B, das den Code modifiziert, nachdem die Länge eines Symbols geändert wurde.A method for the adaptive prefix-free coding of a data stream over an alphabet of n elements with two parameters p and q, where p is greater than q, characterized in that the codeword lengths due to the length of the already coded part of the text divided by parameters q and rounded up, further called quantized text length, as well as the frequency of the symbol in the already coded part of the text divided by parameter q and rounded, furthermore called quantized frequency, and indicates:
A list R of already coded symbols
A method A for updating the length of the coded symbol
A method B which modifies the code after the length of a symbol has been changed.

Description

Unsere Erfindung ist ein adaptives Verfahren für die präfixfreie Codierung. Die präfixfreie Codierung ist eine der am weitesten verbreiteten Kompressionsmethoden und findet in vielen Kompressionsprogrammen und Dateiformaten (z. B. gzip, JPEG, PNG usw.) ihre Anwendung.Our Invention is an adaptive method for prefix-free coding. The prefix-free Coding is one of the most widely used compression methods and found in many compression programs and file formats (e.g. Gzip, JPEG, PNG, etc.) their application.

Das Problem der statischen präfixfreien Codierung kann wie folgt formuliert werden: Gegeben ist ein Datenstrom S = s₁...s_m über dem Alphabet a₁, a₂, ..., a_n. Jedem Symbol a_i ∈ A wird ein Codewort c_i zugewiesen, so dass die Gesamtcodierungslänge möglichst klein ist. Man unterscheidet zwischen der statischen und der adaptiven präfixfreien Codierung. Bei der statischen Codierung wird der Datenstrom zweimal gelesen. Beim ersten Durchlauf durch die Daten werden die Häufigkeiten von Symbolen ermittelt. Wenn die Symbolhäufigkeiten bekannt sind, wird ein Code erzeugt. Beim zweiten Durchlauf werden die Symbole codiert. Ein optimaler statischer präfixfreier Code kann mit dem klassischen Huffman-Algorithmus erzeugt werden. Um den statischen Code zu erzeugen, benötigen die statische Verfahren die Häufigkeiten von allen Symbolen. Dies bedeutet, dass der Text S zweimal gelesen werden muss. Außerdem muss die Information über den Code mit dem codierten Datenstrom gespeichert werden.The problem of static prefix-free coding can be formulated as follows: Given is a data stream S = s ₁ ... S _m over the alphabet a ₁ , a ₂ ,..., A _n . Each symbol a _i ∈ A is assigned a code word c _i , so that the total coding length is as small as possible. A distinction is made between static and adaptive prefix-free coding. In static coding, the data stream is read twice. The first pass through the data will determine the frequencies of symbols. If the symbol frequencies are known, a code is generated. The second pass encodes the symbols. An optimal static prefix-free code can be generated using the classic Huffman algorithm. To generate the static code, the static methods require the frequencies of all symbols. This means that the text S must be read twice. In addition, the information about the code must be stored with the encoded data stream.

Bei der adaptiven Codierung benutzt man den (optimalen) Code für den Datenstrom s₁s₂...s_i-1, um das Symbol s_i zu codieren. Der Vorteil von adaptiven Verfahren ist, dass die Datei nur einmal gelesen wird. Die Algorithmen für adaptive Huffman Codierung wurden z. B. von Faller, Gallager und Knuth sowie vom Vitter entwickelt. Die Laufzeit von diesen Algorithmen ist proportional zu M, wobei M die Anzahl der Bits in dem codierten Datenstrom ist.In adaptive coding, one uses the (optimal) code for the data stream s ₁ s ₂ ... s _i-1 to encode the symbol s _i . The advantage of adaptive methods is that the file is read only once. The algorithms for adaptive huffman coding were z. B. developed by Faller, Gallager and Knuth and the vitter. The running time of these algorithms is proportional to M, where M is the number of bits in the encoded data stream.

Dieses Patent beschreibt ein Verfahren mit einer Laufzeit, die linear in m ist, wobei m die Anzahl der Symbole im Datenstrom S ist. Das vorgestellte Verfahren garantiert außerdem eine gute obere Schranke für die Länge des codierten Datenstroms.This Patent describes a method with a transit time that is linear in m, where m is the number of symbols in the data stream S. The presented Procedure is guaranteed as well a good upper bound for the Length of the coded data stream.

Im folgenden Abschnitt beschreiben wir die allgemeine Struktur von adaptiven Kompressionsverfahren. Im Abschnitt 3 beschreiben wir kanonische Codes. Unsere Erfindung wird in den Abschnitten 4 und 5 erklärt.in the following section we describe the general structure of adaptive compression method. In section 3 we describe canonical codes. Our invention is described in Sections 4 and 5 explained.

1 Adaptive Verfahren1 adaptive procedure

Wir bezeichnen mit m die Länge des zu codierenden Datenstroms S = s₁s₂...s_m und M bezeichnet die Codierungslänge, d. h. M ist die Anzahl der Bits, die wir benötigen, um den Datenstrom S zu codieren. Wir bezeichnen mit occ(a, i) die Häufigkeit des Symbol a in s₁s₂...s_i. Wir bezeichnen mit occ(a) = occ(a, m) die Häufigkeit des Symbols a in S. Mit (NYT) bezeichnen wir ein spezielles Symbol, das sich von allen Symbolen in S unterscheidet. Ein adaptives Codierungsverfahren besteht aus folgenden Schritten:

1. Wir erzeugen einen präfixfreien Code für (NYT)s₁s₂...s_i-1. Der Code für (NYT)s₁s₂...s_i-1 wird in der Regel dadurch erzeugt, dass man den Code für (NYT)s₁s₂...s_i-2 modifiziert. Dazu kann man z. B. den Algorithmus von Gallager, Faller und Knuth (siehe [1]–[3]) verwenden, oder das hier beschriebene Verfahren.
2. Falls das Symbol s_i bereits in s₁s₂...s_i-1 wenigstens einmal aufgetreten ist, wird das Codewort für s_i verwendet, um das Symbol s_i zu codieren.
3. Falls s_i zum ersten mal in S₁...s_i vorkommt, wird das Codewort für das spezielle Symbol (NYT) ausgegeben, gefolgt von einer Binärdarstellung des Indizes von s_i. Das Symbol (NYT) wird verwendet, um anzugeben, dass das nächste Symbol ein ”neues” Symbol ist.

We denote by m the length of the data stream to be encoded S = s ₁ s ₂ ... s _m and M denotes the coding length, ie M is the number of bits we need to encode the data stream S. We denote by occ (a, i) the frequency of the symbol a in s ₁ s ₂ ... s _i . We denote by occ (a) = occ (a, m) the frequency of the symbol a in S. By (NYT) we denote a special symbol which differs from all symbols in S. An adaptive coding method consists of the following steps:

1. We generate a prefix-free code for (NYT) s ₁ s ₂ ... s _i-1 . The code for (NYT) s ₁ s ₂ ... s _i-1 is typically generated by modifying the code for (NYT) s ₁ s ₂ ... s _i-2 . This can be z. For example, use the algorithm of Gallager, Faller and Knuth (see [1] - [3]), or the method described here.
2. If the symbol s _{i has} already occurred at least once in s ₁ s ₂ ... s _i-1 , the codeword for s _{i is} used to code the symbol s _i .
3. If s _{i occurs} for the first time in S ₁ ... s _i , the code word for the special symbol (NYT) is output, followed by a binary representation of the index of s _i . The symbol (NYT) is used to indicate that the next symbol is a "new" symbol.

Die Schritte 2 und 3 sind leicht zu implementieren. Unsere Erfindung beschäftigt sich hauptsächlich mit dem Schritt 1.The Steps 2 and 3 are easy to implement. Our invention employed mainly with the step 1.

2 Vorherige Verfahren2 Previous procedures

In diesem Abschnitt befassen wir uns mit anderen adaptiven Codierungsverfahren, die vor unserer Arbeit entwickelt wurden. Der so genannte FGK Algorithmus wurde unabhängig von Faller und Gallager entwickelt und von Knuth wesentlich verbessert. Dieser Algorithmus wird in den Arbeiten [1], [2] und [3] beschrieben. Der FGK Algorithmus ermöglicht uns, den optimalen Code für den bereits codierten Teil des Datenstroms (NYT)s₁...s_i-1 aufrechtzuerhalten. Das Verfahren von Vitter [5] basiert ebenfalls darauf, dass man den optimalen Code für den (NYT)s₁...s_i-1 aufrechterhält. Dadurch, dass man einen speziellen optimalen Code verwendet, wird in dem Verfahren von Vitter [5] eine bessere obere Schranke für die Länge des codierten Datenstroms erreicht. Beide Algorithmen ermitteln die Codewortlänge des nächsten zu codierenden Symbols anhand von Häufigkeiten von allen Symbolen in dem bereits codierten Teil des Textes.In this section, we will explore other adaptive coding techniques developed before our work. The so-called FGK algorithm was developed independently of Faller and Gallager and significantly improved by Knuth. This algorithm is described in papers [1], [2] and [3]. The FGK algorithm allows us to maintain the optimal code for the already encoded part of the data stream (NYT) s ₁ ... s _i-1 . The method of Vitter [5] is also based on maintaining the optimal code for the (NYT) s ₁ ... s _i-1 . By using a special optimal code, a better upper bound on the length of the encoded data stream is achieved in the method of vitter [5]. Both algorithms determine the code word length of the next symbol to be coded by means of Frequencies of all symbols in the already encoded part of the text.

Ein weiteres relevantes Verfahren wurde von Turpin und Moffat entwickelt und wird in der Arbeit [4] beschrieben. Das Verfahren basiert darauf dass man den optimalen Code anhand von approximierten Häufigkeiten der Symbole ermittelt. Da in dem Verfahren von Turpin und Moffat ebenfalls ein optimaler Code benutzt wird, wird die Länge des nächsten zu codierenden Symbols anhand von approximierten Häufigkeiten von allen Symbolen ermittelt.One another relevant method was developed by Turpin and Moffat and is described in the paper [4]. The method is based on it that the optimal code based on approximated frequencies of the symbols. As in the method of Turpin and Moffat Also, if an optimal code is used, the length of the next symbol to be coded on the basis of approximated frequencies determined by all symbols.

In unserem Verfahren hängt die Länge des nächsten zu codierenden Symbols nur von der approximierten Häufigkeit dieses Symbols in dem bereits codierten Teil des Datenstroms und von der approximierten Länge des bereits codierten Teils des Datenstroms ab. Dies unterscheidet unseren Algorithmus von den Algorithmen, die in [1, 2, 3, 5, 4] beschrieben werden.In depends on our method the length the next symbol to be coded only from the approximated frequency this symbol in the already coded part of the data stream and from the approximated length of the already coded part of the data stream. This makes a difference our algorithm from the algorithms used in [1, 2, 3, 5, 4] to be discribed.

Ein weiteres wichtiges Merkmal unseres Verfahrens besteht darin, dass in unserem Verfahren die Codewortlänge des nächsten zu codierenden Symbols anhand approximierten Häufigkeit und approximierten Länge des Textes ermittelt wird. Die Idee, approximierte Häufigkeiten anstatt exakten Häufigkeiten zu verwenden, wurde vorher nur in der Arbeit von Turpin und Moffat [4] verwendet. In ihrem Verfahren wird aber eine andere Approximationsformel verwendet: anstatt Symbolhäufigkeit f wird die größte Zahl f' < f verwendet, so dass f' = 2^r/k für ein Parameter k und r eine beliebige ganze Zahl ist. In unserem Verfahren werden die Häufigkeit des Symbols und die Länge des Textes einfach durch Parameter q geteilt und abgerundet (quantisiert).Another important feature of our method is that in our method the codeword length of the next symbol to be coded is determined on the basis of approximated frequency and approximated length of the text. The idea of using approximated frequencies rather than exact frequencies was previously used only in the work of Turpin and Moffat [4]. However, another approximation formula is used in their method: instead of symbol frequency f, the largest number f '<f is used, so that f' = 2 ^{r / k} for a parameter k and r is an arbitrary integer. In our method, the frequency of the symbol and the length of the text are simply divided by parameter q and rounded down (quantized).

3 Kanonische Codes3 Canonical codes

Kanonische Codes sind eine Klasse von präfix-freien Codes. Ein kanonischer Code hat die sog. numerical sequence property; Codewörter mit der gleichen Länge sind Binärdarstellungen von aufeinanderfolgenden ganzen Zahlen. Ein Beispiel von einem kanonischen Code kann man in der 1 sehen.Canonical codes are a class of prefix-free codes. A canonical code has the so-called numerical sequence property; Codewords of the same length are binary representations of consecutive integers. An example of a canonical code can be found in the 1 see.

Sei l_max die maximale Codewortlänge, und sei n_i, i = 1, ..., l_max, die Anzahl der Codewörter mit Länge i. Mit base[i] bezeichnen wir das erste (kleinste) Codewort der Länge i. Man kann base[i] nach der folgenden rekursiven Formel erzeugen: base[0] = 0, base[i] = (base[i – 1] + n_i-1) × 2. Das j-te Codewort der Länge i ist die Binärdarstellung der Zahl base[i] + j – 1. Daher, falls die Länge l des Codewortes für das Symbol a und der Index von a unter allen Codewörtern der Länge l bekannt sind, kann man das Codewort für das Symbol a in konstante Zeit finden. Z. B. für den Code in 1 ist n₁ = n₂ = 0, n₃ = 2, n₄ = 6, und n₅ = 8. Dann ist base[1] = base[2] = 0, base[3] = 0, base[4] = 4 und base[5] = 20. Das Codewort für das Symbol a₆ ist das 4-te Codewort der Länge 4. Für a₆ ist das Codewort also base[4] + (4 – 1) = 4 + 3 = 7 oder (0111)₂ in Binärdarstellung. Somit ist der kanonische Code durch Parameter n_i, i = 1, ..., l_max, eindeutig definiert.Let l _{max be} the maximum codeword length, and let n _i , i = 1, ..., l _max , be the number of codewords of length i. With base [i] we denote the first (smallest) codeword of length i. You can generate base [i] according to the following recursive formula: base [0] = 0, base [i] = (base [i - 1] + n _i-1 ) × 2. The jth codeword of length i is the binary representation of the number base [i] + j - 1. Therefore, if the length l of the codeword for the symbol a and the index of a among all codewords of length l are known, one can get the codeword for the symbol a in constant time Find. For example, for the code in 1 is n ₁ = n ₂ = 0, n ₃ = 2, n ₄ = 6, and n ₅ = 8. Then, base [1] = base [2] = 0, base [3] = 0, base [4] = 4 and base [5] = 20. The codeword for the symbol a ₆ is the 4th codeword of length 4. For a ₆ the codeword is base [4] + (4 - 1) = 4 + 3 = 7 or (0111) ₂ in binary representation. Thus, the canonical code is uniquely defined by parameter n _i , i = 1, ..., l _max .

4 Codierung mit Quantisierung4 coding with quantization

In unserem Verfahren werden die Codewortlängen (und Parameter n_i) aufgrund der quantisierten Häufigkeiten der Symbole festgestellt. Sei occ_q(a, i) = ⌊occ(a, i)/q⌋ und P_q(i) = ⌈i/q⌉. Das Parameter q > 1 wird im folgenden Quantisierungsparameter genannt. Angenommen, dass der Datenstrom S₁s₂...s_i-1 bereits codiert wurde. Die Codewortlänge

für das nächste Symbol s_i erfüllt die folgende Bedingung:

In our method, the codeword lengths (and parameters n _i ) are determined based on the quantized frequencies of the symbols. Let occ _q (a, i) = ⌊occ (a, i) / q⌋ and P _q (i) = ⌈i / q⌉. The parameter q> 1 is called in the following quantization parameter. Assume that the data stream S ₁ s ₂ ... s _{i-1 has} already been coded. The code word length

for the next symbol s _i satisfies the following condition:

Betrachten wir den Fall occ(s_i, i – 1 > q. Der Zähler des Bruchs

wird inkrementiert nachdem q neue Symbole codiert wurden. Wenn occ_q(s_i, i – 1) inkrementiert wurde, wird die Codewortlänge für das Symbol s_i eventuell geändert. Wenn dies der Fall ist, wird der Code modifiziert. Man kann das Quantisierungparameter q so wählen, dass der Code in konstanter amortisierter Zeit modifiziert werden kann. Der Nenner des Bruchs

wird inkrementiert, nachdem q beliebige Symbole codiert wurden. Falls aber P_q(i) inkrementiert wurde, können sich die Codewortlängen von mehreren Symbolen ändern. Um die Invariante

zu erhalten, werden alle Symbole s_i in einer Liste R gespeichert. Nachdem q Symbole codiert wurden, werden die ersten q Symbole aus der Liste R entfernt. Für jedes entfernte Symbol a_r wird seine Codewortlänge aktualisiert. Danach werden die q entfernten Symbole am Ende der Liste R eingefügt und der Code C wird entsprechend modifiziert. Man kann das Parameter q so wählen, dass der Code C in Zeit, die linear in q ist, aktualisiert werden kann (also in konstanter amortisierter Zeit).Consider the case occ (s _i , i - 1> q. The numerator of the fraction

is incremented after q new symbols have been encoded. If occ _q (s _i , i-1) has been incremented, the codeword length for the symbol s _i may be changed. If so, the code is modified. You can choose the quantization parameter q so that the code can be modified in a constant amortized time. The denominator of the break

is incremented after q has been encoded with any symbols. But if P _q (i) has been incremented, then the codeword lengths of several symbols change. Around the invariant

To obtain all symbols s _i are stored in a list R. After q symbols have been coded, the first q symbols are removed from the R list. For each remote symbol a _r , its codeword length is updated. Thereafter, the q removed symbols are inserted at the end of the list R and the code C is modified accordingly. You can choose the parameter q so that the code C can be updated in time, which is linear in q (ie in constant amortized time).

5 Adaptives Quantisiertes Codierungsverfahren5 Adaptive Quantized Coding Method

In diesem Abschnitt geben wir eine detaillierte Beschreibung des Verfahrens, mit dem man den quantisierten Code aktualisieren kann.In In this section, we give a detailed description of the process, with which you can update the quantized code.

Für jedes Symbol a speichern wir die Länge len(a) des Codewortes für a und den Index ind(a) des Codewortes für a unter allen Codewörtern mit der Länge len(a). Alle Symbole a mit occ(a, i) > 0 (alle Symbole, die in dem bereits codierten Teil des Textes wenigstens einmal vorkommen) werden in der Liste R gespeichert. Alle Symbole, deren Codewörter die Länge l haben, werden in einer Liste C[l] gespeichert.For each Symbol a we save the length len (a) of the codeword for a and the index ind (a) of the codeword for a among all codewords the length len (a). All symbols a with occ (a, i)> 0 (all symbols in the already encoded Part of the text at least once) will be in the list R saved. All symbols whose codewords have the length l are in one List C [l] stored.

Falls die Codewortlänge von einem Symbol a von l₁ auf l₂ geändert wird, werden die folgenden Operationen durchgeführt. Sei ind(a) = i, sei a' das letzte Codewort der Länge l₁. Der Index von a' wird von

auf i geändert. Wir entfernen a' aus C[l₁] und ersetzen a mit a' in C[l₁],

wird um 1 gesenkt. Somit wird a' zum i-ten Codewort mit der Länge l₁ (d. h. das neue Codewort für a' ist nun das alte Codewort für a). Das neue Codewort für a ist das letzte Codewort mit der Länge l₂. Wir setzen ind(a) auf

inkrementieren

und fügen a am Ende von C[l₂] ein. Alle diese Operationen können schnell implementiert werden. Wenn die Werte von

aktualisiert wurden, kann man das Array base[] aktualisieren. Zum Beispiel, betrachten wir den Code in der 1 und nehmen an, dass die Länge des Codewortes für das Symbol a₆ auf 5 geändert wurde. Wir dekrementieren n₄ und ersetzen a₆ durch a₈. Der Index von a₈ wird auf 4 gesetzt (das neue Codewort für a₈ ist (0111)₂). Das neue Codewort für a₆ ist das letzte Codewort der Länge 5: ind[a₆] = 9, len[a₆] = 5 und n₅ = 9. Das Array base[] wird entsprechend geändert: base[5] = 18 = (10010)₂. Das neue Codewort für a₆ ist also base[5] + ind[a₆] – 1 = (11010)₂.If the code word length of one symbol a is changed from ₁ to 1 ₂ , the following operations are performed. Let ind (a) = i, let a 'be the last codeword of length l ₁ . The index of a 'is from

changed to i. We remove a 'from C [l ₁ ] and replace a with a' in C [l ₁ ],

is lowered by 1. Thus, a 'becomes the ith codeword of length l ₁ (ie the new codeword for a' is now the old codeword for a). The new codeword for a is the last codeword with the length l ₂ . We put on ind (a)

increment

and insert a at the end of C [l ₂ ]. All these operations can be implemented quickly. If the values of

have been updated, you can update the base [] array. For example, let's look at the code in the 1 and assume that the length of the codeword for the symbol a _{6 has been changed} to 5. We decrement n ₄ and replace a ₆ with a ₈ . The index of a ₈ is set to 4 (the new codeword for a ₈ is (0111) ₂ ). The new codeword for a ₆ is the last codeword of length 5: ind [a ₆ ] = 9, len [a ₆ ] = 5 and n ₅ = 9. The array base [] is changed accordingly: base [5] = 18 = (10010) ₂ . The new codeword for a ₆ is therefore base [5] + ind [a ₆ ] - 1 = (11010) ₂ .

Der Algorithmus codiert den Datenstrom s₁s₂...s_m. Nachdem das Symbol s_i codiert wurde, wird der Code wie folgt modifiziert:

1. Falls das Symbol s_i zum ersten Mal vorkommt, wird die Länge von s_i auf ⌈log(i + n + 1)⌉ gesetzt und s_i wird am Ende der Liste R eingefügt.
2. Falls occ(s_i, i – 1) > 0 und occ(s_i, i) ≡ 0 (mod q) (falls s_i mehr als einmal im s₁...s_i vorkommt und die Häufigkeit von s_i durch q teilbar ist), wird das Symbol s_i aus der Liste R entfernt und die Länge von s_i auf
gesetzt. Der Code wird wie oben beschrieben aktualisiert.
3. Sei p ein Parameter. Falls i > 0 und i ≡ 0 (mod q), werden die erste p Elemente
aus der Liste R entfernt. Die Codewortlängen für die entfernten Symbole werden aktualisiert. Falls
wird die Länge des Codewortes für das Symbol
auf
gesetzt. Falls
wird die Länge des Codewortes für das Symbol
auf ⌈log(i + n + 1)⌉ gesetzt. Zum Schluss werden die entfernten Symbole am Ende der Liste R eingefügt.

The algorithm encodes the data stream s ₁ s ₂ ... s _m . After the symbol s _{i has been} coded, the code is modified as follows:

1. If the symbol s _i occurs for the first time, the length of s _{i is set} to ⌈log (i + n + 1) ⌉ and s _i is inserted at the end of the list R.
2. If occ (s _i , i-1)> 0 and occ (s _i , i) ≡ 0 (mod q) (if s _{i occurs} more than once in s ₁ ... s _i and the frequency of s _i is divisible by q), the symbol s _{i is} removed from the list R and the length of s _i on
set. The code is updated as described above.
3. Let p be a parameter. If i> 0 and i ≡ 0 (mod q), the first p elements become
removed from the list R. The code word lengths for the removed symbols are updated. If
becomes the length of the codeword for the symbol
on
set. If
becomes the length of the codeword for the symbol
set to ⌈log (i + n + 1) ⌉. Finally, the removed symbols are inserted at the end of the R list.

Man kann zeigen, dass die obere Schranke für die Anzahl der Bits, die man braucht, um mit diesem Verfahren einen Datenstrom S zu codieren, (H + 1)m + O(nlog²m) ist, falls der Parameter q = logm ist. In dieser Formel ist

die empirische Entropie.It can be shown that the upper bound for the number of bits needed to encode a data stream S by this method is (H + 1) m + O (nlog ² m) if the parameter q = logm , In this formula is

the empirical entropy.

Der beschriebene Algorithmus kann auch in der Praxis effizient eingesetzt werden.Of the described algorithm can also be used efficiently in practice become.

Literaturliterature

[1] N. Faller, "An Adaptive System for Data Compression ", in Record of the 7th Asilomar. Conference on Circuits, Systems, and Computers (1973), 593-597.
[2] R.G. Gallager, "Variations on a Theme by Huffman ", IEEE Trans. On Information Theory 24 (1978), 668-674.
[3] D.E. Knuth, "Dynamic Huffman Coding ", J. Algorithms 6 (1985), 163-180.
[4] A. Turpin, A. Moffat, "On-line Adaptive Canonical Prefix Coding with Bounded Compression Loss ", IEEE Trans. on Information Theory, 47 (2001), 88-98.
[5] J.S. Vitter, "Design and Analysis of Dynamic Huffman Codes ", J. ACM 34 (1987), 825-845.

Claims

Procedure for the adaptive prefix-free Coding of a data stream via an alphabet of n elements with two parameters p and q, where p greater than q is characterized in that the codeword lengths are due to of the length of the already coded part of the text divided by parameter q and rounded up, further called quantized text length, as well as the frequency of the symbol in the already coded part of the text divided by parameters q and rounded, still called quantized frequency be, and points to: - one List R of already coded symbols A method A for updating the length of the coded symbol - one Method B, which modifies the code after the length of a Symbols changed has been.

Method according to claim 1, characterized in that that for every at least once coded element of the input alphabet its Codeword length and its index among all codewords with the same length where the indices of the symbols having the same code word length are one one-start sequence of consecutive integers form.

Method according to claim 1, characterized in that that every new codeword with the length l inserted into the code to the last codeword with the length l becomes, i. H. the new codeword has the largest index among all codewords the length l.

Method according to claim 1, characterized in that that the first codeword with the length 1 has the value 0 and the first codeword with the length l is twice the sum of the first codeword of length l - 1 and the number of codewords with the length l - 1 is determined.

Method according to claim 1, characterized in that that the list R of already coded elements of the input alphabet updated regularly as follows becomes: after a sequence of q input symbols has been coded removes the first p elements from the list R and the codeword lengths of the removed symbols are updated; the codeword length of a each removed symbol a is changed as follows: - If the frequency of a is less than or equal to q and i input symbols are already encoded become the codeword length set from a to the smallest integer k, so that the kth power of two at least as big as the sum of i and n is incremented by one. - If the frequency from a larger than q is, becomes the codeword length set from a to the smallest integer k, so that the kth power of two at least as big as The relationship from the sum of the quantized text length and n to the quantized frequency is. Finally the symbols are inserted at the end of the list R again.

Method B according to claim 1, characterized in that when changing the length of the codeword for a symbol a from l ₁ to l ₂ the following operations are performed: - the symbol with the largest codeword of length l ₁ becomes the index of the old codeword for a - the new codeword of the symbol a becomes the last codeword with the length l ₂ - the number of codewords with lengths l ₁ and l ₂ is changed accordingly.

Method A according to claim 1, characterized in that after the encoding of the ith symbol in the data stream, further called s, the following operations are performed: if s occurs for the first time in the data stream, the codeword length for s is set to smallest integer k is set so that the kth power of two is at least as large as the sum of i and n is incremented by one. If the frequency of s is a multiple of q, the codeword length of s is set to the smallest integer k such that the kth power of two is at least as large as the ratio of the sum of the quantized text length and n to the quantized frequency ,

Method according to claim 1, characterized in that that the code before the beginning of the coding is empty and contains no codewords.

Method according to claim 1, characterized in that that a new codeword of length l, which is added to the code, to the last codeword with the length l becomes.

The method of claim 1, wherein the very first Symbol in the data stream as a binary representation of the index of This symbol is encoded in the input alphabet.

The method of claim 10, wherein after the very first Symbol s are coded, the codewords for symbols (NYT) and the symbol s inserted in the code.

The method of claim 1, wherein a symbol a coded for the first time, but not the very first symbol in the data stream is followed by a special symbol (NYT) the index of a is encoded in the input alphabet.

The method according to claim 1, wherein a next to coding symbol a, in the already coded part of the data stream at least once, coded with the codeword corresponding to the symbol a becomes.

The method of claim 13, wherein the codeword for a Icon with the length l and the index i as the binary representation the sum of the first codeword of length l and index i um 1 decremented is determined.