FI110374B

FI110374B - Method of compressing information

Info

Publication number: FI110374B
Application number: FI20010525A
Authority: FI
Inventors: Petri Kuukkanen; Jukka Saarinen
Original assignee: Coression Oy
Priority date: 2001-03-16
Filing date: 2001-03-16
Publication date: 2002-12-31
Also published as: FI20010525A; FI20010525A0; WO2002075929A1

Description

110374110374

Menetelmä tiedon pakkaamiseksiMethod of compressing information

Keksinnön ala 5Field of the Invention 5

Nyt esillä oleva keksintö koskee patenttivaatimuksen 1 johdanto-osan mukaista menetelmää tietojen pakkaamiseksi. Nyt esillä oleva keksintö koskee myös patenttivaatimuksen 9 johdanto-osan mukaista laitteistoa.The present invention relates to a method for compressing information according to the preamble of claim 1. The present invention also relates to an apparatus according to the preamble of claim 9.

10 Nykyisin tehokkaimmat tietojenpakkaamisjärjestelmät, kuten Ziv-Lempel-tyyppiset algoritmit, PPM [4] ja Block-Sorting [5] ovat mukautuvia ja vaativat pitkiä tietojonoja, jotta ne voivat olla tehokkaita. Tietojen pakkaamisessa esiintyy kuitenkin huomattavia ongelmia, kun pakattavat tiedostot ovat lyhyitä. Esimerkkejä ovat sähköpostiviestit ja 15 kirjeet, joissa tiedoston pituus vaihtelee kymmenistä merkeistä tuhansiin merkkeihin. Koska tällaisten tiedostojen koodaus ja dekoodaus on tehtävä erikseen (ks. kuval), niitä ei voida yleensä ketjuttaa pitkiksi merkkijonoiksi, mikä mahdollistaisi pakkaamisen mukautuvilla algoritmeilla.10 Today, the most efficient data compression systems, such as Ziv-Lempel-type algorithms, PPM [4] and Block-Sorting [5], are adaptive and require long data queues to be effective. However, there are significant problems with data compression when the files to be compressed are short. Examples include e-mails and 15 letters with file lengths ranging from tens of characters to thousands of characters. Because such files must be encoded and decoded separately (see pic), they cannot usually be concatenated into long strings, which would allow for compression by adaptive algorithms.

2020

Itse asiassa ensimmäiset tietojenpakkaamisjärjestelmät ennen mukautuvien algoritmien tuloa olivat mukautumattomia, jolloin tietoja : mallinnettiin kiinteällä Markov-tyyppisellä rajallisella tilakoneella. Tämä t · muodostettiin keräämällä kunkin tilan symbolien esiintymisestä tilastoja 25 opetustiedoista, joiden ajateltiin vastaavan pakattavia tiedostoja. Esi-merkki tällaisesta tiedonpakkaamisjärjestelmästä mustavalkokuviaIn fact, the first data compression systems prior to the arrival of adaptive algorithms were non-adaptive, whereby the data was: modeled on a fixed Markov-type finite state machine. This t · was formed by collecting statistics on the occurrence of symbols in each state from 25 educational data that were thought to correspond to the files to be compressed. An example of such a data compression system is monochrome images

I · II · I

·:··; varten on esitetty julkaisussa Langdon, G.G., Jr. ja Rissanen, J.J.,·: ··; for use are disclosed in Langdon, G.G., Jr. and Rissanen, J.J.

.···. "Compression of Black-White Images with Arithmetic Coding", IEEE. ···. Compression of Black-White Images with Arithmetic Coding, IEEE

Trans. Communication, Vol. Com-29, No. 6, s. 858-867, June 1981.Trans. Communication, Vol. Com-29, no. 6, pp. 858-867, June 1981.

. . 30 Tällaisissa järjestelmissä on kuitenkin ongelmana erityisesti suuri aak- kosto, kuten luonnollisissa kielissä, ja käytettävän rajallisen tilakoneen rakenteen ja koon valinta. Saavutettu pakkaaminen paranee tiettyyn :Y: rajaan saakka, kun koneelle annettujen tilojen lukumäärä kasvaa.. . However, such systems have a particular problem with a large alphabet, such as in natural languages, and the choice of structure and size of the limited space machine used. The resulting compression improves to a certain: Y: limit as the number of spaces given to the machine increases.

Tästä syntyy kuitenkin ongelma, että parametrien lukumäärä kasvaa * *« \ 35 eksponentiaalisesti sen Markovin tilakoneen kertaluvun kanssa, johon tietojen on tarkoitus sopia. Esimerkiksi ensimmäisen kertaluvun Marko-’···’ vin koneessa 256 merkin aakkostolle tulee olemaan sama määrä tiloja ja 2562 parametria, kun taas toisen kertaluvun koneella on parempi 2 110374 suorituskyky, mutta se vaatii 2563 parametria, ja niin edelleen. On selvää, että useita tiloja ei koskaan esiinny missään opetustiedoissa, mikä aiheuttaa ongelman, että pitäisi valita tärkeimmät tilat niistä, joita ylipäätään esiintyy, ja koneen pitämiseksi hallittavan kokoisena tulisi 5 valita jokin haluttu määrä tiloja, jotka antavat parhaan pakkaamis-tuloksen.However, this creates the problem that the number of parameters increases * * «\ 35 exponentially with the order of the Markov state machine to which the data is intended to fit. For example, a first-order Marko '···' vin machine will have the same number of states and 2562 parameters for the 256-character alphabet, while a second-order machine has a better 2 110374 performance but requires 2563 parameters, and so on. Obviously, multiple modes will never appear in any instructional data, which causes the problem of choosing the most important modes from the ones that exist at all, and choosing the desired number of modes that give the best compression result to keep the machine manageable.

Keksinnön yhteenveto 10 Nyt esillä olevan keksinnön tarkoituksena on saada aikaan parannettu menetelmä ja laitteisto lyhyiden tietojen häviöttömäksi pakkaamiseksi. Keksintö perustuu siihen ajatukseen, että käytetään algoritmia tutkimaan Markovin koneen solmuja sen määrittämiseksi, minkä solmujen lehdet voidaan jättää pois. Tämä algoritmi tuottaa tällöin yksinkertai-15 semman koneen tietojen pakkaamiseksi. Tästä pakkaamiskoneesta käytetään tässä selityksessä myös nimitystä puukone. Tarkemmin sanottuna keksinnön mukaiselle menetelmälle on tunnusomaista se, mikä on esitetty patenttivaatimuksen 1 tunnusmerkkiosassa. Keksinnön mukaiselle laitteistolle on tunnusomaista se, mikä on esitetty patentti-20 vaatimuksen 9 tunnusmerkkiosassa.SUMMARY OF THE INVENTION It is an object of the present invention to provide an improved method and apparatus for lossless compression of short data. The invention is based on the idea of using an algorithm to examine nodes of a Markov machine to determine which leaves of a node can be omitted. This algorithm then produces a simpler 15 compression machine for data compression. This packaging machine is also referred to as a wood machine in this specification. More particularly, the method according to the invention is characterized in what is stated in the characterizing part of claim 1. The apparatus according to the invention is characterized in what is stated in the characterizing part of claim 9 of claim 20.

Kun jonkin kertaluvun mukaiseen Markovin koneeseen ei oteta mukaan : kaikkia tiloja, päädytään itse asiassa sovittamaan puukoneita (Tree ·."·· Machines, TM) tietoihin [3]. Tässä keksinnössä esitetään tapa muo- 25 dostaa tietojenpakkaamisjärjestelmä, joka on tarkoitettu lyhyiden tie-dostojen pakkaamiseen halutun kokoisella kiinteällä puukoneella.When a Markov machine of any order is excluded: all modes, it is actually the case that Tree ·. "·· Machines (TM) are matched to data [3]. The present invention provides a way to construct a data compression system for short packing dostoos with a solid wood machine of desired size.

·;··: Tämä on maksimaalinen alipuukone, joka on optimaalinen kaikkien .··*. samankokoisten alipuukoneiden joukossa. Maksimaalinen puukone voidaan määrittää yksinkertaisella tavalla suuresta joukosta lyhyitä tie-. . 30 dostoja, joita kutsutaan opetusjoukoksi, jotka edustavat pakattavia tie- dostoja, kuten selostetaan jäljempänä tässä selityksessä. Kun tämä on ·;·* tehty, muodostetaan alipuukone, jossa on haluttu määrä solmuja ja :V: joka on opetustietojen osalta optimaalinen sikäli, että se sallii niiden pakkaamisen lyhyimmällä ideaalisella koodinpituudella. Näin ollen täi- »»* ' . 35 lainen kone yhdistää kokoelman tiedostojen tilastolliset piirteet parem- min kuin mikään muu samankokoinen puukone. Toisessa vaiheessa * « luodaan aritmeettinen koodi kunkin lyhyen tiedostosymbolin koodaamiseksi symbolilla käyttäen ennalta muodostettuja (mukauttamattomia) 3 11037 4 yhteenlaskettavien (koodisanojen) taulukkoja ns. koodaussolmuissa. Koodaussolmu on puun syvin solmu, jossa opetustiedoissa esiintyvällä merkillä on positiivinen lukema. Merkit, joita ei esiinny, koodataan sopivasti valitulla pakomekanismilla, jota selostetaan jäljempänä tässä 5 selityksessä.·; ··: This is the maximum submachine that is optimal for everyone. among submachines of the same size. The maximum wood machine can be determined in a simple manner from a large number of short roads. . 30 files, called teaching sets, that represent files to be compressed, as will be described later in this specification. Once this is done; · · *, a sub-machine is formed with the desired number of nodes and: V: which is optimal for the training data in that it allows it to be compressed in the shortest ideal code length. Therefore, »» * '. A 35-piece machine combines the statistical features of the files in the collection better than any other tree of the same size. In the second step, an arithmetic code is generated to encode each short file symbol with the symbol using the so-called preformatted (non-customized) tables. koodaussolmuissa. The coding node is the deepest node in the tree where the character in the teaching data has a positive reading. Characters that do not occur are suitably encoded by the selected escape mechanism, which will be described below in this specification.

Keksintö tarjoaa tekniikan tasoon nähden useita etuja. Erityisesti lyhyiden tekstien pakkaaminen suoritetaan paljon tehokkaammin kuin tekniikan tason mukaisilla menetelmillä. Parempi pakkaamisteho vähentää 10 tällaisten pakattujen tietojen tallennukseen tarvittavan muistitilan tarvetta, tiedonsiirtotarvetta ja tietojen siirtämiseen vaadittavaa aikaa.The invention offers several advantages over the prior art. In particular, short text compression is performed much more efficiently than prior art methods. Improved compression efficiency reduces the memory, data transfer, and data transfer times needed to store 10 such compressed data.

Keksintöä selostetaan seuraavassa tarkemmin viitaten samalla oheisiin piirustuksiin, joissa 15 kuva 1 a esittää kaaviomaisesti lyhyen datatiedoston pakkaamis-vaiheita nyt esillä olevan keksinnön erään edullisen suoritusmuodon mukaisessa menetelmässä, 20 kuva 1 b esittää yksinkertaistettuna lohkokaaviona elektronista laitetta, jossa voidaan soveltaa nyt esillä olevan keksinnön erään edullisen suoritusmuodon mukaista menetelmää, •:j kuva 2 esittää maksimaalisen puukoneen muodostamista opetus- 25 tiedoista ja alipuukoneen muodostamista karsimalla maksi- ;*··; maalinen puu nyt esillä olevan keksinnön erään edullisen ....: suoritusmuodon mukaisella algoritmilla, kuva 3 esittää datalohkon pakkaamista aritmeettisen koodaimen tai . . 30 Huffman-koodin avulla ja todennäköisyyksiä karsitussa puukoneessa, : Y: kuva 4 esittää pakatun datalohkon pakkaamisen purkua, ’’ 35 kuva 5 esittää keksinnön edullisen suoritusmuodon mukaisella menetelmällä ja joillakin tekniikan tason mukaisilla mene-...: telmillä suoritetun pakkaamisen simulointituloksia, 110374 kuva 6 esittää esimerkkiä puukoneesta, joka on muodostettu esi-merkkiopetustiedoista, kuva 7 esittää kuvan 6 mukaisen kontekstipuun eräiden solmujen 5 testausta karsintaa varten, kuva 8 esittää esimerkkiä solmun karsimisesta, kuva 9 esittää esimerkkiä solmun tarkistamisesta karsintaa varten, 10 kuva 10 esittää esimerkkiä kontekstipuusta, kuva 11 esittää keksinnön mukaisen menetelmän edullisen algoritmin itsestään palautuvaa toteutusta ja esimerkkiä konteks-15 tipuusta, ja kuva 12 esittää keksinnön mukaisen menetelmän toisen edullisen algoritmin toteutusta lohkokaaviona.The invention will now be described in more detail with reference to the accompanying drawings, in which Fig. 1a schematically illustrates steps for compressing a short data file in a method according to a preferred embodiment of the present invention; Fig. 1b shows a simplified block diagram of an electronic device; Figure 2 shows the training data formed by the maximum wood machine and the construction of the sub wood machine by pruning the maximum; * ··; Fig. 3 illustrates compression of a data block by an arithmetic encoder or. . 30 with the Huffman code and probabilities in the pruning tree, Y: Figure 4 shows the decompression of a compressed data block, '' 35 Figure 5 shows simulation results of a compression performed by a method according to a preferred embodiment of the invention and some prior art methods, 110374 Fig. 6 shows an example of a tree machine made up of precedence learning data, Fig. 7 shows a test of some nodes 5 for pruning in the context tree of Fig. 6, Fig. 8 shows an example of pruning a node, Fig. 9 shows an example of verifying a node for pruning 11 shows a self-recovering implementation of a preferred algorithm of the method of the invention and an example of a contex-15 drop, and FIG. 12 shows a block diagram of implementing another preferred algorithm of the method of the invention.

2020

Yksityiskohtainen kuvaus Maksimaalinen puukone ·: i 25 Seuraavassa selostetaan puukonetta aakkostoa A varten, jonka koko on d, jolloin edullisen suoritusmuodon mukaan d = 256. Puukone on puu, jossa kustakin solmusta s lähtee enintään dkpl haaroja, eli lapsi-solmuja, jolloin solmuilla on lukemat n^, jotka osoittavat kertasymbolin / lukumäärää, jonka aakkosto "esiintyy" "kontekstissa" s. Karkeasti 30 sanottuna "konteksti" s merkitsee tilaa Markov-tyyppisessä rajallisessa .···’. tilakoneessa. Seuraavaksi selostetaan, kuinka opetustiedot, jotka ·’] muodostuvat suuresta joukosta yleisluonteisia tiedostoja, jotka on kir- : : joitettu peräkkäin ja jotka muodostavat pitkän jonon x" = χλ, ..., xn aak- kostoni merkkejäx„ määrittävät puukoneen, jonka enimmäissyvyys • .·. 35 on K.DETAILED DESCRIPTION Maximum Wood Machine ·: i 25 The following describes a woodworking machine for alphabet A of size d, in which, according to a preferred embodiment, d = 256. A woodworking machine is a tree with up to dk branches from each node, i.e. child nodes. n ^, which indicate the single symbol / number whose alphabet "occurs" in "context" s. Roughly 30, "context" s denotes a space in a Markov-type finite. ··· '. the state machine. The following describes how the instructional information, · '] consisting of a large number of generic files written:: sequentially and forming a long string of x "= χλ, ..., xn characters in the alphabet" x ", defines a wood machine with a maximum depth • . · .35 is K.

• * * 5 110374• * * 5 110374

Kukin jono s= i, j, ..., m, jonka pituus on k, k < K, ja joka esiintyy x?\r\ alajonona s = xt.k+1, ... , xt jonkin f.n suhteen, määrittää juuresta solmuun s polun s= m, ..., j, / = xf, xM, ... , x,.*+1, joka määräytyy lukemalla merkit * käänteisessä järjestyksessä. Oletetaan, että jono s esiintyy 5 ns kertaa. Näillä kerroilla jonon ϊ jälkeen tulee välittömästi An merkit, ellei jono ole opetustietojen viimeinen jono. Olkoon As niiden erityisten merkkien joukko, jotka seuraavat välittömästi s :n eri esiintymiskertoja, ja esiintyköön merkki / n, |S kertaa, jolloin ^ieA η/μ = ns; tällöin oletetaan, että 5 ei muodostu xn:n viimeisistä merkeistä, vaan sen jälkeen 10 voi tulla jokin merkki. Puukoneessa on lukemat {n,|S} tallennettu solmuun s. Tärkeänä seikkana on huomattava, että jos x^ssä esiintyy jono xt.k, ... ,xt, niin samoin esiintyy lyhyempi jono, jota määrittää ?= xt. k+1, ... , xt, tai että mikä tahansa merkki xf+1, joka esiintyy jossakin lapsi-solmussa xt, ..., xt-k+i, Xt-k, eli seuraa välittömästi xf:tä, esiintyy varmasti 15 myös lyhyemmässä, kauemmassa solmussa s = xt, ..., xt-k+1. Tästä seuraa, että jos s’ = sxt.k = xt, Xf-i, ... , xt.k+i, Xt-k osoittaa sen lapsisolmua, niin i' kaikille / e As-, kun taas joukkojen A- yhdistelmä lapsisolmujen yli on 20 As:n osajoukko. Toisin sanoen puu on epätäydellinen: puusta puuttuu joitakin lapsisolmuja, ja puun joka solmussa ei tarvitse esiintyä kaikkia /.:: aakkoston merkkejä.Each sequence s = i, j, ..., m of length k, k <K, and occurring as x? \ R \ a subset of s = xt.k + 1, ..., xt with respect to fn, determines from root to node of the path s = m, ..., j, / = xf, xM, ..., x,. * + 1, which is determined by reading the characters * in reverse order. Assume that the sequence s occurs 5 ns times. These times, immediately after jon, the characters An will appear immediately, unless the queue is the last queue in the instruction data. Let As be the set of special characters that immediately follow the different occurrences of s, and let / n, | S occur where ^ ieA η / μ = ns; in this case it is assumed that 5 is not one of the last characters of xn, but after that 10 may become a character. An important thing to note is that if x ^ has a string xt.k, ..., xt, then a shorter string defined by? = Xt will also occur. k + 1, ..., xt, or that any character xf + 1 occurring in a child node xt, ..., xt-k + i, Xt-k, ie immediately following xf, is certain to occur 15 also in the shorter, farthest node s = xt, ..., xt-k + 1. It follows that if s '= sxt.k = xt, Xf-i, ..., xt.k + i, Xt-k points to its child node, then i' for all / e As, while the set A is a combination. there is a subset of 20 As over the child nodes. In other words, the tree is incomplete: some child nodes are missing from the tree, and not every character in the /. :: alphabet needs to appear in every node of the tree.

,···’ Kuvassa 6 on esitetty puukone, joka perustuu opetustietoon "VESIHIISI, ··· 'Figure 6 shows a wood machine based on the teaching information "YOUR WATER

25 SIHISI HISSISSÄ". Tällaista puukonetta ja itse asiassa mitä tahansa sen alipuukonetta, jonka juuri on maksimaalisen puukoneen juuressa, ··’ voidaan käyttää annetun aakkoston merkkien minkä tahansa jonon koodaamiseen. Koodausprosessia selostetaan jäljempänä, mutta aluksi kuvataan ns ideaalista koodinpituutta tai empiiristä entropiaa, 30 jonka kukin puukone antaa opetusjonolle.25 Such a knife machine, and indeed any submachine that is just at the root of the maximal machine, ·· 'can be used to encode any string of characters in a given alphabet. The coding process will be explained below, but first the so-called ideal code length or empirical entropy which each wood machine gives to the teaching queue.

» » ·»» ·

Opetustietojen ideaalinen koodinpituus » ·Ideal Code Length for Teaching Information »·

Ajatellaan mitä tahansa alipuutaW maksimaalisessa puussa, joka 35 määräytyy solmuista λ, s1( ..., sw, missä λ kuvaa juurisolmua. Kun on käyty läpi koko x77, kussakin solmussa s on lukemat {n, |S}. Solmussa s 6 110374 esiintyvän merkin /' ideaaliseksi koodinpituudeksi määritellään log(ns/ A7,|S). Kuten edellä on mainittu, sama merkki esiintyy myös kauemmassa solmussa. Oletetaan, että s(t) merkitsee puun syvintä solmua, josta käytetään nimitystä lehtisolmu, jossa opetustietojen sym-5 boli xt esiintyy. Tämä solmu määritellään kiipeämällä puuhun lukemalla ohitetun jonon V'1 merkit oikealta vasemmalle. Opetustietojen kaikkien merkkien xt ideaalisten koodinpituuksien summa on puukoneen W opetusjonolle V7 antama ideaalinen koodinpituus, eli %(*η) = Σ log—(1) t nx,\s(t) 10 Jos ei oteta huomioon muutamaa ensimmäistä enintään K merkkiä, saadaan ideaaliselle koodinpituudelle helpompi kaava:Consider any sub-tree W in a maximal tree 35 defined by the nodes λ, s1 (..., sw, where λ represents the root node. After going through x77, each node s has readings {n, | S}. The ideal code length for the character / 'is defined as log (ns / A7, | S). As mentioned above, the same character is also present in the farthest node Suppose s (t) denotes the deepest node of the tree called the leaf node, where sym-5 boli xt occurs This node is defined by climbing into a tree by reading the digits of the skipped queue V'1 from right to left The sum of the ideal code lengths for all characters xt in the instruction information is the ideal code length given by the wood machine W for instruction queue V7, , \ s (t) 10 If you ignore the first few characters of K at most, you can get an easier formula for the ideal code length:

Lw(xn)= Yns logns - logn^ , (2) E ϊ(Ξ As jossa E osoittaa puun syvimpien solmujen joukkoa, joissa merkit / esiintyvät. Nämä solmut määrittelevät selvästi maksimaalisen puun 15 koon ja muodon.Lw (xn) = Yns logns - logn ^, (2) E ϊ (Ξ As where E represents the set of deepest nodes in the tree where the characters / occur. These nodes clearly define the maximum size and shape of the tree.

KoodausEncryption

Puukonetta käytetään jonojen koodaamiseen paljolti rajallisena tila-. :i 20 koneena: Pakattava jono, esimerkiksi ^ y2, , yn, luetaan edulli- ; sesti vasemmalta oikealle, ja kutakin merkkiä yf+1 kohti puuhun kiive- . . tään lukemalla edeltävät merkit toisin päin yh yM ... = s, kunnes löy- , .; detään solmu s* merkin yM kontekstiksi, minkä jälkeen merkki kooda taan tämän solmun lukemien määrittämällä ehdollisella todennäköisyy-25 della P(y/+1|/). Jos aakkoston kaikki cf merkkiä esiintyisivät kaikissa . solmuissa, solmuja s* voitaisiin pitää puun syvimpinä solmuina reitillä ·' Yt, Yt-i, ·· Kuitenkin suurissa aakkostoissa kaikki merkit eivät esiinny : ’ kaikissa solmuissa, mistä syntyy ongelma, kuinka tulee koodata merkit, : joiden lukemat n, | s ovat nollia.A knife machine is used to encode queues in a largely limited space. : 20 as machine: The string to be compressed, for example ^ y2,, yn, is preferred; from left to right, and for each character yf + 1 in the tree a kiwi. . reverse reading the preceding characters yh yM ... = s until you find,.; the node s * is set as the context of the character yM, and then the character is encoded by the conditional probability P (y / + 1 | /) determined by the readings of that node. If all the cf characters in the alphabet were present in all. nodes, nodes s * could be considered as the deepest nodes in the tree in the path · 'Yt, Yt-i, ·· However, in capital letters, not all characters occur:' in all nodes, which gives rise to the problem of how to encode characters: s are zeros.

3030

Ongelma voidaan ratkaista usealla eri tavalla. Tarkastellaan kontekstia, joka on esiintynyt 20 kertaa. Jos "A" on ainoa merkki, joka on seurannut kontekstia kaikissa näissä esiintymissä, todennäköisyys, että 7 110374 seuraava merkki olisi eri merkki, on hyvin pieni. Kuitenkin jos perässä olisi tullut esim. 10 eri merkkiä, näyttäisi siltä, että pakattava teksti on hyvin vaihteleva ja että lähiaikoina ilmenee todennäköisesti jokin uusi merkki. Jos jossakin kontekstissa olisi ilmennyt q merkkiä, niin sille 5 annetaan todennäköisyys q/ns, että "seuraavaa merkkiä ei ole aiemmin esiintynyt tässä kontekstissa". Tällä tavoin minkä tahansa aiemmin esiintymättömän merkin ehdollinen todennäköisyys on 4,1.,=^-— (3) '1 ns + q a-q jossa d - q on kaikkien tässä kontekstissa esiintymättömien merkkien 10 lukumäärä. Aiemmin esiintyneiden merkkien ehdolliset todennäköisyydet ovat (4) ns+q Näillä todennäköisyyksillä voidaan kussakin syvimmässä koodaus-solmussa muodostaa aritmeettinen koodi tai Huffman-koodi.There are several ways to solve the problem. Consider the context that has occurred 20 times. If "A" is the only character that has followed the context in all of these appearances, the probability that the 7 110374 next character would be a different character is very small. However, if there were, for example, 10 different characters to follow, it would appear that the text to be packaged is very variable and that a new character is likely to appear in the near future. If there were q characters in any context, then it would be given a probability of q / ns that "the next character has not previously appeared in this context". In this way, the conditional probability of any previously non-existent character is 4.1., = ^ -— (3) '1 ns + q a-q where d - q is the number of all 10 non-existent characters in this context. The conditional probabilities of the previously occurring characters are (4) ns + q These probabilities can be used to generate an arithmetic code or Huffman code at each of the deepest coding nodes.

1515

Algoritmit tietynkokoista optimaalista alipuuta vartenAlgorithms for optimum subtree of certain sizes

Seuraavaksi etsitään maksimaalisesta puusta, jossa on M solmua, se alipuu, jossa on x":lle pienin ideaalinen koodinpituus. Tällainen opti-20 maalinen alipuu on maksimitodennäköisyyspuu, jossa on M solmua, koska ideaalinen koodinpituus on x^n suurimman todennäköisyyden negatiivinen logaritmi maksimaalisen puun osapuiden luokassa, jossa ; kussakin solmussa s esiintyville merkeille on ehdolliset todennäköisyy- .'·[ det Ennen kuin selostetaan algoritmia maksimaalisen puun 25 karsimiseksi halutunkokoiseksi optimaaliseksi alipuuksi kuvataan .··. yksinkertaisempaa algoritmia, joka on optimaalinen sillä rajoituksella, että karsitaan joko kaikki tai ei yhtään solmun haaraa. Tämä vähentää . . selvästi joukkoja E kussakin tutkitussa emosolmussa, ei välttämättä :,/ kokonaisuutena vaan lapsisolmujen lukumäärällä. Tämä algoritmi ;·’ 30 soveltuu pieniin aakkostoihin.Next, we look for the maximal tree with M nodes, the subtree with the smallest ideal code length for x ". Such an opt-20 target subtree is a maximum probability tree with M nodes because the ideal code length is the negative logarithm of the maximum probability x x n of the maximal tree. · [det Before describing the algorithm for pruning the maximum tree to the optimal optimal sub-tree of the desired size is described. any node branch. This reduces... clearly sets E in each parent node examined, not necessarily:, / as a whole, but the number of child nodes This algorithm; · '30 is suitable for small alphabets.

8 1103748 110374

Algoritmi AAlgorithm A

1. Aloitus: Laske kullekin puun emosolmulle s koodinpituus L(s) = ^nj|5(-log2 Pi\s) käyttäen s:n ja yhtälön (4) lukemia arvolle1. Getting Started: For each parent tree in the tree, calculate the code length L (s) = ^ nj | 5 (-log2 Pi \ s) using the values of s and equation (4) for

5 V5V

2. Kääntäen laske lehdistä alkaen kussakin emosolmussa s: i(s) = ^L(sj). Jos l(s) ^ L(s) + ε, karsi kaikki lapsisolmut sj ja niiden jälkeläiset; muutoin jätä kaikki sy':t koskemattomiksi. Aseta L(s) = min{/(s), L(s)}.2. In turn, compute s: i (s) = ^ L (sj) starting from the leaves in each parent node. If l (s) ^ L (s) + ε, prune all child nodes sj and their offspring; otherwise leave all sy 's untouched. Set L (s) = min {/ (s), L (s)}.

10 3. Jatka juureen asti.10 3. Continue to the root.

Tässä on toinen karsinta-algoritmi, jolla tutkitaan yhtä haaraa kerrallaan eikä niitä kaikkia yhtäaikaa:Here is another pruning algorithm that studies one branch at a time and not all at once:

15 Algoritmi B15 Algorithm B

Algoritmi B:n suorittamisen aluksi määritellään arvo ε (lohko 1 kuvan 11 lohkokaaviossa). Tämän jälkeen karsiminen aloitetaan juurisolmusta.The execution of Algorithm B is initially determined by the value ε (block 1 in the block diagram of Figure 11). After this, pruning begins at the root node.

1. Aloitus: Laske kullekin puun lehtisolmulle s koodinpituus 20 £(5) = 5)^(-1(^2^) käyttäen s:n ja yhtälöiden (3) ja (4) luke- ..· · mia arvolle P^. Aseta kullekin sisäiselle solmulle s L(s) = 0. Tätä ·,'·· aloitusvaihetta on kuvattu pääasiassa kuvan 11 lohkoissa 2-6.1. Getting Started: For each leaf node of a tree, calculate a code length of 20 £ (5) = 5) ^ (- 1 (^ 2 ^) using the numbers of s and equations (3) and (4) to P ^ Set each internal node s L (s) = 0. This ·, '·· initialization step is illustrated mainly in blocks 2-6 in Figure 11.

: 2. Kääntäen laske lehdistä alkaen kussakin solmussa sj.: 2. Conversely, count the leaves sj from each leaf.

i{sj) = Y,ni\sj(~iog2 ^|Ä); eli käyttäen lapsisolmun sj lukemia ja ’ 25 emosolmun s todennäköisyyksiä. Kasvata emosolmun koodin- pituutta L(s) arvolla min{/(sy), L(sj)}. Jos l(sj) £ L(sj) + ε, karsi kaikki haarat syja niiden lapset; muutoin jätä sj koskemattomaksi. Tätä vaihetta on kuvattu pääasiassa kuvan 11 lohkoissa 7-10.i {sj) = Y, ni \ sj (~ iog2 ^ | Ä); that is, using child node sj readings and '25 parent node s probabilities. Increase the length of the parent node code L (s) by min {/ (sy), L (sj)}. If l (sj) £ L (sj) + ε, prune all branches and eat their children; otherwise leave sj intact. This step is mainly described in blocks 7-10 of Figure 11.

•; · ‘ 3. Jatka juureen asti.•; · '3. Continue to the root.

30 Näin saatavan puun arvo riippu arvosta ε. Kun ε= 0, karsittu puu pak-.·. kaa ainakin yhtä hyvin kuin kärsimätönkin puu mutta on yleensä pie- .···. nempi. Kun ε> 0, karsittu puu on pienempi kuin tarvitaan maksimaa lista pakkaamista varten, mutta se on kuitenkin paras kaikista saman- 9 110374 kokoisista puista. Jos ε on tarpeeksi suuri, karsitun puun koko on 1, käsittäen vain juurisolmun.30 The value of the wood thus obtained depends on ε. When ε = 0, the pruned tree is packed ·. cut at least as well as impatient wood but are generally small. ···. overtensioning. When ε> 0, the pruned tree is smaller than the maximum packing list required, but it is still the best of all trees of the same size. If ε is large enough, the size of the pruned tree is 1, including only the root node.

Helppo tapa määrittää ε .η oikea arvo on muodostaa puolitushaulla ali-5 puu, jossa on M solmua:An easy way to determine the correct value of ε .η is to construct a sub-5 tree with M nodes by halftone search:

Algoritmi CAlgorithm C

1. Aseta £-high arvoksi max{L(sy) - /(s/)} puun kaikissa sisä-solmuissa s. Aseta £j0w = 0 (lohko 11 kuvan 12 lohkokaavi- 10 ossa).1. Set £ -high to max {L (sy) - / (s /)} for all tree nodes s. Set £ j0w = 0 (block 11 in block diagram 10 of Figure 12).

2. Karsi puu arvolla + £iow) / 2 ja laske näin saadun puun solmut (lohko 12). Jos lukema ei muuttunut (lohko 13) tai lukema on sama kuin joukon lukumäärät (lohko 14), poistu. Jos lukema on pienempi kuin lukema, 15 joka halutaan (lohko 15), aseta £high = %|h /2 (lohko 16).2. Prune the tree to + £ iow) / 2 and count the nodes of the resulting tree (block 12). If the reading has not changed (block 13) or the count is the same as the set numbers (block 14), exit. If the reading is less than the desired 15 (block 15), set £ high =% | h / 2 (block 16).

Jos lukema on suurempi kuin lukema, joka halutaan, aseta ehigh = £high /2 (lohko 17). Palauta alkuperäinen puu ja palaa kohtaan 2 (lohko 12).If the reading is greater than the desired reading, set ehigh = £ high / 2 (block 17). Restore the original tree and return to step 2 (block 12).

3. Jatka, kunnes viimeinen lukema on sama kuin edellinen.3. Continue until the last reading is the same as the previous one.

2020

Seuraavaksi tarkastellaan karsintamenetelmän B esimerkkinä kuvassa 6 esitettyä kontekstipuuta. Alkaen lehdistä tarkastellaan sol- ; mun "Hl" kahta haaraa eli solmuja “_HI" ja “IHI” (ks. kuva 7). Näiden • · · lehtisolmujen koodinpituudet L asetetaan aloitusvaiheessa ja lasketaan : 25 yhtälön (4) avulla: U"-ΗΓ) = Σ(ηί\"-ΗΓ (“log2 Pi\"_HI")) O vf f , ni\"_HI" Y| , ( , 1 1 , = Σ ni\” HI" “l°g2-7- =!· “I°g2— Π1In the following, the context tree shown in Figure 6 is exemplified by pruning method B. From the magazines look at sol-; the two branches of my "Hl" or "_HI" and "IHI" nodes (see Figure 7). The code lengths L of these • · · leaf nodes are set at initialization and calculated by: 25 Equation (4): U "-ΗΓ) = Σ (ηί \ "- ΗΓ (" log2 Pi \ "_ HI")) O vf f, ni \ "_ HI" Y |, (, 1 1, = Σ ni \ "HI" "l ° g2-7- =! ·" I ° g2— Π1

n" HI”+9” HI" y 1 + 1Jn "HI" +9 "HI" y 1 + 1J

V V - J) v < · t » t • ’ L("////") = ^(nj|"////"(-log2 Pi\"iHI")) vf f , nn"iHr il , r , n.f, i ^ „ =Σ ni\” IHI" “ log 2 -—- =1· -log2—- +1· “log 2 T—Γ =4VV - J) v <· t »t • 'L (" //// ") = ^ (nj |" //// "(- log2 Pi \" iHI ")) vf f, nn" iHr il, r, nf, i ^ "= Σ ni \" IHI "" log 2 -—- = 1 · -log2—- + 1 · "log 2 T — Γ = 4

; : ^ η'ΊΗΓ +ΠΗΓ) \ 2 + 2J l 2 + 2J; : ^ η'ΊΗΓ + ΠΗΓ) \ 2 + 2J l 2 + 2J

i 10 110374 ί Tässäkin qs osoittaa solmussa s esiintyvien eri merkkien kokonaislukumäärää, ja /7/|S on merkkien / lukema solmussa s. Lukemien summaa solmussa son merkitty ns:llä.i 10 110374 ί Again, qs represents the total number of different characters occurring on n, and / 7 / | S is the number of characters / n on s. The sum of the readings on n is denoted by ns.

5 Karsintavaiheessa uudet koodinpituudet / lasketaan samalla kaavalla kuin haaran lukemat ja emosolmun "Hl" todennäköisyydet.5 In the pruning step, the new code lengths / are calculated using the same formula as the branch readings and the probabilities of the parent node "HI".

Ι("-ΗΓ') = Σ(ηΐ\η_Η1" (- l°§2 %///")) f ni\"H /" X) f 2 λ = Σ ni\"-HI" ~ lo§2-!- =1' — log2 t ~ ~ 1-32Ι ("- ΗΓ ') = Σ (ηΐ \ η_Η1" (- l ° §2% /// ")) f ni \" H / "X) f 2 λ = Σ ni \" - HI "~ lo§ 2 -! - = 1 '- log2 t ~ ~ 1-32

^ ^ n" Hl" +q"HI")j V 3 + 2 J^^ n "Hl" + q "HI") j V 3 + 2 J

/("////") = log2 Pi\"HI")) ν'f f ni\'Hl" Ilf 1 ( 2 λ = Zj ni\'lHr ~ 1°§2- =1- — log2 “—- +1- -log2~—- =3.64/ ("////") = log2 Pi \ "HI")) ν'ff ni \ 'Hl "Ilf 1 (2 λ = Zj ni \' lHr ~ 1 ° §2- = 1- - log2" - - + 1- -log2 ~ —- = 3.64

^ ^ ”"///"+ ΨΉΓ JJ l 3 + 2 J { 3 + 2 J^^ ”" /// "+ ΨΉΓ JJ l 3 + 2 J {3 + 2 J

10 /.(“Hl”) on arvojen min{ L(“_HI”), /(“_HI”)} ja min{ L(“IHI”), /(“I Hl”)} summa, joka on yhtäsuuri kuin L(“_HI”) + /(“IHI”) * 4,64. Karsinta tapahtuu aina, kun /(s)<L(s) + e. Tässä kohdassa oletetaan, että ε = 0.10 /. ("HI") is the sum of min {L ("_ HI"), / ("_ HI")} and min {L ("IHI"), / ("I Hl")} equals L (“_ HI”) + / (“IHI”) * 4.64. Pruning occurs whenever / (s) <L (s) + e. This section assumes that ε = 0.

Näin ollen solmu “_HI” pysyy koskemattomana ja solmu ΊΗΓ karsitaan 15 pois (ks. kuva 8).Thus, the node “_HI” remains intact and the node ΊΗΓ is removed 15 (see Figure 8).

i i Seuraavassa vaiheessa verrataan juuri laskettua arvoa /.(“Hl”) arvoon ·!": /(“Hl”): 0 /C'H/") = S(n,r„r(-log2/;rr)) :· 20 = Σί Il = I (-logl 2 (-,og3 = 7.10 /(“Hl”) on paljon suurempi kuin L(“HI”) + ε eikä solmua “Hl” karsita (ks.ii In the next step, compare the just calculated value /.( "Hl") to ·! ": / (" Hl "): 0 / C'H /") = S (n, r „r (-log2 /; rr)) : · 20 = Σί Il = I (-logl 2 (-, og3 = 7.10 / ("Hl")) is much larger than L ("HI") + ε and the "Hl" node is not pruned (see

‘‘"i kuva 9). Jos L(“HI”) + solisi suurempi tai yhtä suuri kuin /(“Hl”), solmu Λ “Hl” ja kaikki sen lapset karsittaisiin.'' "I figure 9). If L (" HI ") + solos were greater than or equal to / (" Hl "), node Λ" Hl "and all its children would be pruned.

• · . !·. 25 Karsintamenettelyä jatketaan kääntäen, kunnes päästään juureen, jota 1 I · 1·', ei voi karsia (koska sillä ei ole emosolmua eikä sitä voida verrata mihinkään).• ·. ! ·. 25 The pruning procedure is continued inversely until a root of 1 I · 1 · 'cannot be pruned (because it has no parent node and cannot be compared to anything).

11 110374 Tässä esimerkissä käytetyssä Patricia-puurakenteessa voi esiintyä solmuja, jotka edustavat useita konteksteja yhdessä kohdassa, kuten puun oikeassa yläkulmassa oleva solmu kontekstille "SSÄ". Solmu 5 "SSÄ" ja sen lukemat voitaisiin (ja ne tulisikin oikeaa karsintaa varten) kirjoittaa kolmeksi eri solmuksi: "Ä", "SÄ" ja "SSÄ", joissa kaikissa on samat esiintyneet merkit ja lukemat. Koska näillä kolmella solmulla on kuitenkin sama lukema, niin /(“SSÄ”) = L(“SÄ”) = /(“SÄ”). Koska tämä solmu on lehti, arvoa L(“SSÄ”) ei voi saada solmun lapsista, ja se on 10 myös sama kuin /(“SSÄ”). Näin ollen /(“SSÄ”) < L(“SSÄ”) + ε ja /(“SÄ”) < L(“SÄ”) + ε (ε on aina suurempi tai yhtä suuri kuin nolla), ja nämä solmut karsittaisiin. Kaikki lehtisolmut, joilla on samat kontekstit, voidaan tehokkaasti muuntaa niiden sisältämäksi lyhyimmäksi kontekstiksi (tässä solmusta "SSÄ", joka sisältää solmut "SSÄ", "SÄ" ja "Ä", tulee 15 solmu "Ä"). Jokaista sisäsolmua, joilla on useita samoja konteksteja, voidaan käsitellä karsinnan aikana tavallisena solmuna ja levittää eri solmuiksi karsinnan jälkeen, mikä helpottaa puun käsittelemistä.110374 In the Patricia tree structure used in this example, there may be nodes that represent multiple contexts at a single location, such as the node in the top right corner of the tree for the "SSI" context. Node 5 "SSA" and its readings could (and should be, for proper pruning) be written into three different nodes: "Ä", "SAD" and "SSA", all of which have the same appearing characters and readings. However, since these three nodes have the same number, / ("SS") = L ("S") = / ("S"). Because this node is a leaf, the value of L ("SSH") cannot be obtained from the children of the node, and it is also equal to / ("SSH"). Thus, / ("SSA") <L ("SSA") + ε and / ("SAA") <L ("SAA") + ε (ε is always greater than or equal to zero), and these nodes would be pruned. All leaf nodes with the same contexts can be effectively converted to the shortest context they contain (here, the "SS", which contains the "SS", "S", and "Ä", becomes the "15" node "Ä"). Each inner node having multiple same contexts can be treated during pruning as a standard node and spread to different nodes after pruning, which facilitates the handling of the tree.

Kun karsiminen ja levitys on suoritettu, kontekstipuu on valmis.Once the pruning and propagation is done, the context tree is complete.

20 Kuva 10 esittää esimerkkiä tällaisesta kontekstipuusta.Figure 10 shows an example of such a context tree.

Algoritmin B käänteistoteutus on esitetty lohkokaaviona kuvassa 11.The inverse implementation of Algorithm B is shown as a block diagram in Figure 11.

,, · i Kuvassa 12 on esitetty lohkokaaviona algoritmin C toteutus.,, · i Figure 12 is a block diagram showing the implementation of algorithm C.

; 25 Esimerkki pakkaamisesta ja pakkaamisen purkamisesta * * ·; 25 Example of Packing and Unpacking * * ·

Yhtälön (4) mukaisesti juurisolmussa olevien merkkien todennäköisyy- I | ...§ det ovat: ~ ηχ + q ~ 23 + 6 ~ 29 ’ ~ 29’ ^λ_29’ ^5Ιλ _ 29’ ^4|λ “ hnd\k ~~ 30 (Ä merkitsee juurikontekstissa tyhjää jonoa.) Yhtälön (3) mukaisesti : V: juurisolmusta puuttuvien merkkien E ja V todennäköisyydet ovat: aa 9 i _ 6 i 6 i 3 Ε\λ-fy\X - ^ d_q - 23 + 6 8_6 ~ 29 2 ~ 29 > I · > · · 35 Näiden kahdeksan merkin todennäköisyyksien summan pitäisi olla 1.According to equation (4), the probability of the characters in the root node I | ... sections are: ~ ηχ + q ~ 23 + 6 ~ 29 '~ 29' ^ λ_29 '^ 5Ιλ _ 29' ^ 4 | λ "hnd \ k ~~ 30 (Ä denotes an empty string in the root context.) The equation ( 3) as follows: V: The probabilities of the missing nodes E and V are: aa 9 i _ 6 i 6 i 3 Ε \ λ-fy \ X - ^ d_q - 23 + 6 8_6 ~ 29 2 ~ 29> I ·> · · 35 The sum of the probabilities of these eight characters should be 1.

Tämä voidaan tarkistaa: 12 110374 2 + 3 + 9 + 7 + 1 + 1+ 3 + 3-29-l 29 29 29 29 29 29 29 29 ~29~This can be checked: 12 110374 2 + 3 + 9 + 7 + 1 + 1+ 3 + 3-29-l 29 29 29 29 29 29 29 29 ~ 29 ~

Merkkien todennäköisyydet soinnussa Ί" ovat: n |'7" 2 2- 2 1 4 = = -, Ρη\"Γ=7Ζ> ρι\'Ί"=-> pS\"S"=- ΐΐ"ΐ" + q 9 + 4 13 1 13 1 13 13 5 ja kontekstista Ί" puuttuvien merkkien E, V, Ä ja end (loppu) todennäköisyydet ovat: n n n n ^ 1 4 14 11 ΡΕ\'Ί" =Pv\-r =Pä\'f = pend\'i" =—— ·-— = TT7'= Ti 7 = 7^ 111 1 ns+q d-q 9 + 4 8-4 13 4 13The probabilities of the characters in the chord Ί "are: n | '7" 2 2- 2 1 4 = = -, Ρη \ "Γ = 7Ζ> ρι \' Ί" = -> pS \ "S" = - ΐΐ "ΐ" + q 9 + 4 13 1 13 1 13 13 5 and the probabilities of the characters E, V, Ä and end (end) missing from the context Ί "are: nnnn ^ 1 4 14 11 ΡΕ \ 'Ί" = Pv \ -r = Pä \' f = pend \ 'i "= —— · -— = TT7' = Ti 7 = 7 ^ 111 1 ns + q dq 9 + 4 8-4 13 4 13

Myös näiden summa on 1: -m 2 2 1 4 1 13 13 13 13 13 13 13Also the sum of these is 1: -m 2 2 1 4 1 13 13 13 13 13 13 13

Samoin voidaan laskea merkkien todennäköisyydet missä tahansa solmussa. Eräiden solmujen tulokset on esitetty luettelona seuraa-vassa taulukossa: 15 Taulukko 1. Merkkien todennäköisyydet eräissä puun solmuissa.Likewise, probabilities of characters at any node can be calculated. The results of some nodes are listed as a list in the following table: Table 1. Signal probabilities in some tree nodes.

Konteksti λ ; Konteksti Konteksti Konteksti Konteksti “I” “II” “SI” “HS” 2/29 | 2/13 1/14 1/4 ] 1/14 E 3/29 I 1/13 1/14 3/40 | 1/14 ·;· j H 3/29 ; 2/13 1/14 1/4 1/14 I 9/29 I 1/13 1/14 3/40 I 1/2 i S 7/29 4/13 1/2 1/8 1/14 V 3/29 1/13 | 1/14 3/40 1/14 ·": Ä 1/29 1/13 ] 1/14 3/40 1/14 end 1/29_1/13 l 1/14 [3/40 l 1/14 : Näiden todennäköisyyksien avulla voidaan laatia Huffman-koodi .···! jokaista tällaista kontekstia varten. Tulokset on esitetty seuraavassa taulukossa: v!: 20 13 110374 _ Taulukko 2. Taulukon 1 merkkien Huffman-koodit._Contexts λ; Contexts Context Contexts Context “I” “II” “SI” “HS” 2/29 | 2/13 1/14 1/4] 1/14 E 3/29 I 1/13 1/14 3/40 | 1/14 ·; · j H 3/29; 2/13 1/14 1/4 1/14 I 9/29 I 1/13 1/14 3/40 I 1/2 i S 7/29 4/13 1/2 1/8 1/14 V 3 / 29 1/13 | 1/14 3/40 1/14 · ": Ä 1/29 1/13] 1/14 3/40 1/14 end 1 / 29_1 / 13 l 1/14 [3/40 l 1/14: These probabilities to generate a Huffman code ···! for each of these contexts The results are shown in the following table: v !: 20 13 110374 _ Table 2. Huffman codes for the characters in Table 1._

Konteksti λ Konteksti Konteksti Konteksti Konteksti “!” “II” “SI” “HS” 0111 010 100 01 100 E 000 0110 1010 1100 1010 H 001 00 1011 10 1011 I 10 110 1100 1101 o S 11 10 0 000 1100 V 010 0111 1101 1110 1101 Ä 01100 1110 1110 I 001 1110 end 01101 1111 l 1111 l 1111 1111Contexts λ Contexts Contexts Context Contexts!! '' II '' SI '' HS '0111 010 100 01 100 E 000 0110 1010 1100 1010 H 001 00 1011 10 1011 I 10 110 1100 1101 o S 11 10 0 000 1100 V 010 0111 1101 1110 1101 Ä 01100 1110 1110 I 001 1110 end 01101 1111 l 1111 l 1111 1111

Seuraavaksi pakataan viesti "VIISI" käyttäen näitä koodisanoja. Viestin seuraavan merkin pakkaamista varten etsitään mallipuusta edellisten 5 merkkien pisin yhteensopiva jono. Esimerkiksi merkin S konteksti on pisin mallipuussa esiintyvistä konteksteista "VII", "II" ja "I" ja λ. Merkille V löytynyt yhteensopivuus on tyhjä (λ). Merkeille I, I, S, I ja end löytyneet yhteensopivuudet ovat λ, “I”, “II”, “MS” and “SI”, tässä järjestyksessä.Next, the message "FIVE" is compressed using these codewords. To compress the next character of the message, look for the longest matching string of the previous 5 characters in the template tree. For example, the context of the sign S is the longest of the contexts in the template tree, "VII", "II" and "I" and λ. The compatibility found for V is empty (λ). Matches found for I, I, S, I and end are λ, “I”, “II”, “MS” and “SI”, respectively.

1010

Merkin V koodisana kontekstissa λ on 010. Merkin I koodisana kontekstissa λ on 10. Merkin I koodisana kontekstissa Ί” on 110, ja niin edelleen. Pakattu jono on 010 10 110 001111.The code word for character V in context λ is 010. The code word for character I in context λ is 10. The code word for character I in context Ί 'is 110, and so on. The packed queue is 010 10 110 001111.

I i I f ·’.·! 15 Pakkaamisen purku suoritetaan vastaavasti. Käytettävä malli on sama kuin pakkaamisessa, jolloin Hoffman-kooditaulukko on jo tiedossa.I i I f · '. ·! 15 Unpacking is performed accordingly. The model used is the same as the one used for compression, so the Hoffman code table is already known.

‘ Ensimmäiselle merkille tiedetään, että konteksti on tyhjä. Ensimmäinen koodisana (010) luetaan, ja tuloksena saadaan merkki V. Seuraavaa ' ··’ merkkiä varten konteksti on pisin mallipuussa esiintyvistä konteksteista 20 "V" ja λ. Nyt luetaan seuraava koodisana, ja tuloksena saadaan I · merkki I. Seuraavan merkin konteksti on pisin mallipuussa esiintyvistä ’ j konteksteista "VI", "I" ja λ. Sitten luetaan taas seuraava koodisana, ja tuloksena saadaan jälleen merkki I, ja niin edelleen. Pakattu jono 01010110001111 kääntyy alkuperäiseksi viestiksi “VIISI”.'For the first character, it is known that the context is empty. The first codeword (010) is read and the result is a character V. For the next '··' character, the context is the longest of the "20" V and λ in the template tree. Now read the next codeword and get the I · character I. The context of the next character is the longest of the 'j' contexts in the template tree, 'VI', 'I' and λ. Then the next codeword is read again, and the result is again a character I, and so on. The compressed queue 01010110001111 turns to the original message “FIVE”.

I II I

Nopeampaa toteutusta varten mallipuu muunnetaan tilakoneeksi. Puun kustakin solmusta (tai kontekstista) tulee tila. Kukin pakattava merkki aiheuttaa tilamuutoksen seuraavan merkin kontekstin tilassa (joka tie- 25 14 110374 detään tähän asti luetuista merkeistä). Näin ollen puuta (joka on nyt tilakone) ei tarvitse käydä läpi joka kerta, kun merkki luetaan, ja kon-tekstinhakualgoritmista tulee 0(1 )-funktio.For faster implementation, the template tree is converted to a state machine. Each node (or context) in the tree becomes a state. Each character to be compressed causes a state change in the context state of the next character (known from the characters read so far). Thus, the tree (which is now a state machine) does not have to go through each time the character is read, and the contextual search algorithm becomes a 0 (1) function.

5 Edellä kuvattu pakkaamismenetelmä ja seuraavat yleisesti käytetyt pakkaamisohjelmat on simuloitu: • COMPRESS, joka perustuu Zivin ja Lempelin LZ77-algoritmi-perheeseen [2].5 The compression method described above and the following commonly used compression programs are simulated: • COMPRESS based on the ZZ and Lempel LZ77 algorithm family [2].

• PKZIP v2.06, joka perustuu Zivin ja Lempelin LZ78-algoritmi- 10 perheeseen [2].• PKZIP v2.06, based on the LZ78 algorithm family of Ziv and Lempel [2].

• ACE v2.0, jonka on kehittänyt Marcel Lemke.• ACE v2.0 developed by Marcel Lemke.

• BZIP2, joka perustuu Burrowsin ja Wheelerin lohkolajittelualgorit-miin [5].• BZIP2, based on Burrows and Wheeler block sorting algorithms [5].

• PPMZ v9.1, jonka on kehittänyt Charles Bloom ja joka perustuu 15 PPM-algoritmiperheeseen.• PPMZ v9.1, developed by Charles Bloom and based on 15 PPM algorithm families.

On huomattava, että keksinnön mukainen menetelmä toimii parhaiten, kun käytetyt opetustiedot ovat samalla kielellä. Tämä merkitsee sitä, että erikielisten tekstien pakkaamisessa tulisi käyttää kutakin kieltä 20 vastaavaa eri tilakonetta.It should be noted that the method of the invention works best when the teaching information used is in the same language. This means that different spatial machines corresponding to each language should be used to compress texts in different languages.

f I I »f I I »

Puukone rakennettiin käyttäen opetustietoina englanninkielistä kirjaili- > · ; suutta. Kaikki tiedostot muunnettiin pieniksi kirjaimiksi aakkoston koon pienentämiseksi. Kuvassa 5 on esitetty tiedostojen pakkaamisen tulok- * · 25 set mainituilla viidellä ohjelmalla ja tämän keksinnön mukaisella edulli-sella menetelmällä, joka on kuvattu tässä selityksessä (tässä on käytetty nimitystä CORESSION). Kuvattu algoritmi, CORESSION, pakkaa pienet tiedostot paljon paremmin kuin muut viisi algoritmia.The wooden machine was built using English-> ·; suutta. All files were converted to lowercase to reduce the size of the alphabet. Figure 5 shows the results of compressing files with the above five programs and the preferred method of the present invention described in this specification (referred to herein as CORESSION). The algorithm described, CORESSION, compresses small files much better than the other five algorithms.

< i · 30 Em. kokeessa käytettyjen kuuden algoritmin keskimääräiset pakkaa-, y, mis- ja purkunopeudet olivat suunnilleen seuraavat: .···. · COMPRESS: 600 kilotavua sekunnissa.<i · 30 Em. the average compression, y, compression and decompression speeds of the six algorithms used in the experiment were approximately as follows: ···. · COMPRESS: 600 kilobytes per second.

\ · PKZIP: 800 kilotavua sekunnissa.\ · PKZIP: 800 kilobytes per second.

I f » y. · ACE: 100 kilotavua sekunnissa.I f »y. · ACE: 100 kilobytes per second.

I I » 35 · BZIP2: 200 kilotavua sekunnissa.I I »35 · BZIP2: 200 kilobytes per second.

15 110374 • PPMZ: 20 kilotavua sekunnissa.15 110374 • PPMZ: 20 kilobytes per second.

• CORESSION: 400 kilotavua sekunnissa aritmeettisella koodauksella, tai 900 kilotavua sekunnissa Huffman-koodauksella.• CORESSION: 400 kilobytes per second with arithmetic coding, or 900 kilobytes per second with Huffman coding.

Keksinnön edullisen suoritusmuodon mukaan käytetyt tietorakenteet 5 ovat kiinteitä. Toisin sanoen tietorakenteita ei tarvitse muuntaa pakkaamisen tai purkamisen aikana, päinvastoin kuin kaikissa mukautuvissa pakkaamisohjelmissa. Näin ollen nyt esillä olevan keksinnön mukaisesta menetelmästä voidaan saada ainakin yhtä nopea kuin mikä tahansa mukautuva algoritmi, jossa on ainakin yhtä monimutkaiset 10 konteksti ja lukemarakenteet ja koodausjärjestelmä kuin nyt esillä olevan keksinnön mukaisessa menetelmässä.According to a preferred embodiment of the invention, the data structures 5 used are fixed. In other words, data structures do not need to be modified during compression or unpacking, unlike all adaptive compression programs. Thus, the method of the present invention can provide at least as fast as any adaptive algorithm with at least as complicated contexts and reading structures and coding system as the method of the present invention.

Näiden kuuden algoritmin muistin käyttö on suunnilleen seuraava: • COMPRESS: joitakin satoja kilotavuja.The memory usage of these six algorithms is approximately as follows: • COMPRESS: A few hundred kilobytes.

15 · PKZIP: joitakin satoja kilotavuja.· PKZIP: A few hundred kilobytes.

• ACE: jopa 36 megatavua, tyypillisesti joitakin megatavuja.• ACE: up to 36 MB, typically a few MB.

• BZIP2: 2-6 kertaa pakattavan tiedoston koko.• BZIP2: 2-6 times the size of the file to be compressed.

• PPMZ: 60-90 kertaa pakattavan tiedoston koko.• PPMZ: 60-90 times the size of the file to be compressed.

• CORESSION: joistakin sadoista kilotavuista joihinkin megatavui- 20 hin, riippuen puun koosta.• CORESSION: from a few hundred kilobytes to some megabytes, depending on the size of the tree.

• ·• ·

Tyypillisen mukautuvan algoritmin muistin käyttö kasvaa pakkaamisen / ja purkamisen aikana, ja se riippuu pakattavan tiedoston koosta sekä halutusta pakkaamissuhteesta. Nyt esillä olevan keksinnön mukaisella ’··· menetelmällä muistin käyttö on vakio eikä riipu tiedoston koosta, minkä : ” 25 ansiosta algoritmi soveltuu hyvin ympäristöön, jossa muisti on rajalli- nen. Lisäksi melkein koko tarvittava muisti voi olla lukumuistia. Nyt esillä olevan keksinnön mukaista menetelmää soveltava laite tarvitsee vain joitakin satoja tavuja työmuistia minkä tahansa kokoisten tiedos-·"*: tojen pakkaamiseen niin kuin pakkaamisen purkamiseenkin. Mukautu- Λ 30 vien algoritmien käyttämän muistin tulee olla kokonaan työmuistia.The memory usage of a typical adaptive algorithm increases during compression / decompression and depends on the size of the file to be compressed and the desired compression ratio. With the '··· method of the present invention, the memory usage is constant and does not depend on the size of the file, which makes the algorithm well suited for a limited memory environment. In addition, almost all of the memory needed can be read only. A device employing the method of the present invention requires only a few hundred bytes of working memory to compress data files of any size, as well as uncompressing. The memory used by the adaptive algorithms must be completely working memory.

. ’·. Kuva 1 esittää yksinkertaistettuna lohkokaaviona elektronista laitet- * · · ta 18, jossa voidaan soveltaa nyt esillä olevan keksinnön erään edullisen suoritusmuodon mukaista menetelmää. Elektroninen laite käsittää 16 110374 ainakin ensimmäisen syöttö- ja tulostuslohkon 19 opetustietojen syöttämiseksi esimerkiksi tietokannasta 23. Elektronisen laitteen 18 ohjaamiseen ja nyt esillä olevan keksinnön mukaisen menetelmän vaiheiden suorittamiseen on järjestetty ohjausyksikkö 20. Ohjausyksikkö 20 voi 5 käsittää yhden tai useampia prosessoreja, kuten mikroprosessorin ja/tai digitaalisen signaalinkäsittely-yksikön. Muistielimet 21 on järjes-! tetty tallentamaan tarvittavat ohjelmakoodit ohjausyksikön toimintaa varten, tilapäisiä tietoja, tietorakenteet, pakattavia tietoja, jne. Pakattavat tiedot voidaan lukea esim. muistielimiltä 21, tietokannasta 23, siir-10 tokanavasta 24, ja/tai sähköisen laitteen 18 näppäimistöltä (ei esitetty). Pakatut tiedot voidaan tallentaa muistielimiin 21, tietokantaan 23, ja/tai ne voidaan siirtää esim. tiedonsiirtokanavaan 24 siirrettäväksi vastaan-ottolaitteeseen (ei esitetty), jossa pakatut tiedot voidaan purkaa. Vas-taanottolaite voi olla myös samanlainen kuin kuvassa 1 esitetty elekt-15 roninen laite. Pakkaamisen purkaminen voidaan suorittaa käyttäen samanlaisia tietorakenteita (puukonetta, äärellistä tilakonetta) kuin pakkaamisessa on käytetty.. '·. Figure 1 is a simplified block diagram of an electronic device 18 in which a method according to a preferred embodiment of the present invention can be applied. The electronic device comprises 16 110374 at least a first input and output block 19 for input of training information from, for example, a database 23. A control unit 20 is provided for controlling the electronic device 18 and performing the steps of the method of the present invention. or a digital signal processing unit. The memory means 21 is sequential! stored data necessary for operating the control unit, temporary data, data structures, data to be compressed, etc. The data to be compressed can be read, e.g., from memory means 21, database 23, transfer channel 24, and / or the keyboard of electronic device 18 (not shown). The compressed data may be stored in memory means 21, database 23, and / or may be transmitted, e.g. The receiving device may also be similar to the electronic device shown in Figure 1. Unpacking can be done using similar data structures (knife, finite space machine) as used for compression.

Karsittu puukone voidaan muuntaa äärelliseksi tilakoneeksi, joka on 20 sinänsä tunnettu. Pakkaaminen voidaan yleensä suorittaa nopeammin äärellisellä tilakoneella kuin karsitulla puukoneella.The pruned wood machine can be converted into a finite space machine, which is known per se. Packing can usually be accomplished faster on a finite space machine than a pruning wood machine.

• Nyt esillä olevaa keksintöä voidaan käyttää useissa sovelluksissa.The present invention can be used in a variety of applications.

• · : Esimerkiksi matkaviestintäympäristössä tekstiviestejä voidaan pakata 25 tarvittavan siirtokapasiteetin vähentämiseksi ennen niiden lähettämistä ·:··: matkaviestinverkkoon. Nyt esillä olevaa keksintöä voidaan soveltaa myös tietokoneissa pakkaamaan tekstitiedostoja, jotka voidaan tämän jälkeen tallentaa muistivälineelle.• ·: For example, in a mobile environment, text messages can be compressed to reduce the 25 necessary transfer capacity before they are sent ·: ··: to the mobile network. The present invention can also be applied to computers to compress text files which can then be stored on a storage medium.

• * 30 Alan asiantuntijalle on selvää, että keksintöä ei ole rajoitettu pelkästään edellä esitettyihin esimerkkeihin, vaan keksinnön suoritusmuodot voivat . · ·. vaihdella seuraavien patenttivaatimusten puitteissa.It will be apparent to one skilled in the art that the invention is not limited to the examples above, but that embodiments of the invention may. · ·. vary within the scope of the following claims.

* » 17 110374 Lähdeluettelo [1] Langdon, G.G., Jr. ja Rissanen, J.J., ‘Compression of Black-White Images with Arithmetic Coding’, IEEE Trans. Communication, Vol. Com-29, No. 6, s. 858 - 867, June 1981 [2] Lempel, A. ja Ziv, J., ‘Compression of individual Sequences via Variable Rate Coding’, IEEE Trans. Information Theory, Vol. IT-24,* »17 110374 References [1] Langdon, G.G., Jr. and Rissanen, J.J., 'Compression of Black and White Images with Arithmetic Coding', IEEE Trans. Communication, Vol. Com-29, no. 6, pp. 858-867, June 1981 [2] Lempel, A. and Ziv, J., 'Compression of Individual Sequences via Variable Rate Coding', IEEE Trans. Information Theory, Vol. IT-24,

No. 5, s. 530 - 536, September 1978 [3] Weinberger, M.J., Rissanen, J., ja Feder, M. (1995), ‘A Universal Finite Memory Source’, IEEE Trans, on Information Theory, Vol. IT-41 , No. 3, s. 643 - 652, May 1995 [4] Bell, T.C., Cleary, J.G. ja Witten, I.H., Text Compression’, Prentice Hall, NJ, s. 140-153, 1990.Well. 5, pp. 530-536, September 1978 [3] Weinberger, M.J., Rissanen, J., and Feder, M. (1995), 'A Universal Finite Memory Source', IEEE Trans, on Information Theory, Vol. 41, No. 3, pp. 643-652, May 1995 [4] Bell, T.C., Cleary, J.G. and Witten, I.H., Text Compression ', Prentice Hall, NJ, pp. 140-153, 1990.

[5] Burrows, M. & Wheeler, D., ‘A Block-Sorting Lossless Data Compression Algorithm’, Digital Equipment Corporation, SRC Research Report 124, 1994.[5] Burrows, M. & Wheeler, D., 'A Block-Sorting Lossless Data Compression Algorithm', Digital Equipment Corporation, SRC Research Report 124, 1994.

4 * · I · 4 4 · I » I » * 4 · » · · « · I · • I *4 * · I · 4 4 · I »I» * 4 · »· ·« · I · • I *

Claims

A lossless data compression method, wherein teaching data comprising a plurality of characters in an alphabet is used to form a tree machine with a certain amount of levels; wherein the tree machine comprises at the highest level a root node, at the lower levels existing nodes, and also branches between nodes of two levels one after another; for each branch is determined an identification corresponding to one character or a plurality of consecutive characters in teaching data; in the process, as a far node, a node which branches into a node at a lower level is determined, as a son node is determined a node with a far node, as a leaf node is determined a node with no sonnets, as an inner node is determined a node with both a farnode and a sonnod; for each node, a context corresponding to the string of the branches is determined from the node to the root node; for each node, at least one tai is determined corresponding to the probability distribution of characters following the node's context; characterized in that a difference value is determined in the method, for each node other than the root node, a first code length (L (s)) and a second code length (1 (s)) are determined; the first code length (L (s)) of a leaf node is determined by the node's own probability distribution; the first code length (1 (s)) of an inner node is determined by using at least one of the following: - the node's own probability distribution; - and / or the probability distribution of at least one '; '; sonnod of an inner node; For the node, a second code length (1 (s)) is determined by the probability distribution of the node's node; wherein if said second code length (l (s)) is less than or equal to the sum of said first code length (L (s)) and said difference value, the node in question and all its nodes are removed from the tree machine. 30

Method according to claim 1, characterized in that a maximum number of nodes is determined, the processing of the first (L (s)) and the second code length (l (s)) being repeated until the number of: nodes in the tree machine is less than or equal to the maximum stated. · * ·. 35 number for nodes. 23 110374

Method according to claim 1 or 2, characterized in that it is used for compressing short alphanumeric data, such as addresses, e-mails, html files, text messages or data blocks. | 5

Method according to claim 1, 2 or 3, characterized in that it comprises the following steps: - For each node s in the tree, a first code length (L (s)) is calculated with the value l (j) = using the number! 10 of s and the equation / ¾ = —'— for the value Ρλ5. ns + q '- Conversely, starting from the leaves, in each vessel s a value is calculated: i (s) = ^ L (sj) revolutions Qm l (s) <L (s) + ε, all the nodes are removed sj and their descendants, otherwise all sewing is left intact,

15. L (s) = min {/ (s), L (s)}, and - the above steps are repeated all the way to the root node.

Process according to any of claims 1-4, characterized in that it comprises the following steps:

20. For each leaf node s in the tree, the first code length | is calculated (L (s)) as the value L ($) = log2? I \ s) by: I »using the number of s and the equation Pj \ s = —'— for the value ns + q: · · ** PijÄ, for each inner node s L (s) = 0, - Conversely, starting from the leaves, in each son-: · ·: 25 node sj a value: i (sj) = ^ (- log2 Pi \ s) using by tai av. . the sonode (sj) and the probabilities of the farnode (s), whereby the first code length L (s) of the farnode is increased by the value min {/ (sy), / L (sj) Y, if l {sj) £ L (sj ) + ε, the sonodes are all branched and all: '': these descendants, otherwise all sys are left intact, and: * .30 - the above steps are repeated all the way to the root.

Method according to any of claims 1-5, characterized in that in order to allow a maximum tree to grow from teaching data, the teaching data is stored in a memory only partially, whereby very large alphabets and teaching data can be processed.

Method according to any of claims 1-6, characterized in that the data to be compressed is encoded with a Huffman code in the tree machine.

Method according to any one of claims 1-6, characterized in that the data to be compressed is encoded with an arithmetic code in the tree machine.

A device (18) for compressing data without loss, comprising means (19) for receiving instructional data, which includes a plurality of a character of an alphabet, means (20, 21) for forming a tree machine comprising a certain amount of levels using the teaching data; wherein the tree machine comprises at the highest level a root node, at the lower levels existing nodes and branches between nodes at two levels in succession; in the tree machine, as a far node, a node with a branch to a node on the following lower level is determined, as a son node is determined a node with a far node, as a leaf node is determined a node with no sonnets, as an inner node is determined a node with bed ..f; a farnod and a sonnod; wherein said means (20, 21) to form one. :: tree machine includes:; Means for determining a context for each node, which. ·. Context corresponds to the string of branches from node to root node; Means for determining at least one tai for each node, which tai corresponds to the probability distribution of characters following the node's context; characterized in that the device (18) further comprises: - means (20) for determining a difference value, - means (20) for determining a first code length (L (s)) for each node,. . Means (20) for determining a second code length (1 (s)) for · [any node other than the root node; wherein the first code length: (L (s)) of a leaf node is arranged to be determined by the node's own probability distribution; the first code length (L (s)) of an inner node is arranged to be determined using at least one of the following: i. the node's own probability distribution; ii. and / or the probability distribution of at least one sonde to an inner node; the second code length (l (s)) of the node is arranged to be determined by means of the probability distribution of the node's node; and means (20) for examining whether said second code length (l (s)) is less than or equal to the sum of said first code length (L (s)) and said difference value, wherein said node and all its solar nodes are arranged to be removed from the tree machine. 15

Device according to claim 9, characterized in that it is a general computer comprising means (19) for receiving teaching data from a database.

Device according to claim 9, characterized in that it is a pen computer, a mobile telephone or a gaming machine. »·

Device according to claim 9, characterized in that the data as; to be compressed is a text message. I * *