GB2595002A - Lossless compression of sorted data - Google Patents

Lossless compression of sorted data Download PDF

Info

Publication number
GB2595002A
GB2595002A GB2007278.1A GB202007278A GB2595002A GB 2595002 A GB2595002 A GB 2595002A GB 202007278 A GB202007278 A GB 202007278A GB 2595002 A GB2595002 A GB 2595002A
Authority
GB
United Kingdom
Prior art keywords
characters
character
compression
columns
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB2007278.1A
Other versions
GB202007278D0 (en
GB2595002B (en
Inventor
Nercessian Andy
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Maymask 171 Ltd
Maymask 171 Ltd
Original Assignee
Maymask 171 Ltd
Maymask 171 Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Maymask 171 Ltd, Maymask 171 Ltd filed Critical Maymask 171 Ltd
Priority to GB2007278.1A priority Critical patent/GB2595002B/en
Publication of GB202007278D0 publication Critical patent/GB202007278D0/en
Publication of GB2595002A publication Critical patent/GB2595002A/en
Application granted granted Critical
Publication of GB2595002B publication Critical patent/GB2595002B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/04Protocols for data compression, e.g. ROHC
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound
    • H03M7/707Structured documents, e.g. XML
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3068Precoding preceding compression, e.g. Burrows-Wheeler transformation
    • H03M7/3077Sorting
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • H03M7/4006Conversion to or from arithmetic code
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
    • H03M7/4006Conversion to or from arithmetic code
    • H03M7/4012Binary arithmetic codes
    • H03M7/4018Context adapative binary arithmetic codes [CABAC]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention compresses strings (e.g. Alex, Alice, Aline, Avery, Bev) of symbols/characters/letters taken from finite alphabets (e.g. the Roman alphabet). The strings are arranged in a matrix grid and sorted, probably into alphabetic/lexicographic order. The invention then operates on characters one column at a time (i.e. all the first characters [AAAAB], all the second characters [lllve], etc.) which follows the direction of sorting. The embodiment identifies blocks in the second and subsequent columns where the characters in preceding columns are identical, and thus the letters are subject to alphabetic ordering (e.g. the 3 rows starting Alex, Alice, Aline). It then calculates character probabilities based on the alphabetic constraint (i.e. letters earlier in the alphabet are ignored) and uses them for compression/encoding, probably arithmetic encoding. The invention may also determine when this approach is unlikely to be efficient and switch to alternative encoding schemes.

Description

Lossless Compression of Sorted Data [1] The present invention relates to compression of data in electronics and telecommunication systems. In particular, the present invention relates to a lossless compression of sorted data where a method and a system are disclosed of a compression suitable for a set of data including columns, whose sort order does not matter or whose sort order is alphabetical in some sense.
[2] The present invention resolves limitations in known data compression systems and methods. The compression of data requires a method and a system that allows a user to determine a balance between compression ratio and performance.
Background of the invention
[3] The proposal is a lossless method of compression suitable for one or more columns of data whose sort order does not matter or whose sort order is alphabetical in some sense. It is a form of context-based arithmetic coding which achieves its compression ratio by a highly skewed probability distribution model. Its application is quite wide and includes a dictionary and a phonebook using a natural language alphabet or certain types of tables in a database using an alphabet comprised of entries selected from a finite number of possibilities. The method includes a slider system that allows the user to determine the best balance between compression ratio and performance.
Statement of invention
[4] According to the inventim, there is provided a method and a system of lossless compression of sorted data that proves useful in applications with access to sufficient processor power and determines a suitable column number for the compression method that is switchable to allow an implementer to optimise compression ratios, in accordance with the features of claim 1.
[5] According to a first aspect of the present invention, there is provided a method of lossless compression of sorted data, including: compressing, at one or more processors, a sorted list of strings formed of characters from a fine literal alphabet, each string forming a row of a plurality of rows of a matrix configuration, 5 and each consecutive character occupying a respective consecutive column of a plurality of columns of the matrix configuration; moving, by the one or more processors, down each column in a sequence; and for each character identified as occupying a block of characters having identical rows in previous columns relative to traversing a predetermined column of the plurality of columns, establishing, by 10 the one or more processors, a probability distribution for an identity of a next character of the characters during a traversal and a screening of the characters.
[6] According to an embodiment, the compressing sorted list is configured by a context-based arithmetic coding that achieves a compression ratio by a highly skewed probability distribution model.
[007] According to an embodiment, the method further includes determining a suitable column number for the compression method that allows an implementer of the method to optimise all compression ratios.
[8] According to an embodiment, the method further comprises compressing a first few columns of the matrix configuration and delaying compression of 20 remaining columns.
[9] According to another embodiment, the compressing includes determining whether a better compression ratio can be attained.
[10] According to yet another embodiment, the method further includes defining at least two thresholds that are: designating a minimum acceptable size of a 25 number of blocks to be compressed; designating a number of blocks that is preferred over a minimum acceptable size to be compressed.
[0111 According to another embodiment, the method further includes during an encoding process, compressing words of variable widths.
[12] According to another embodiment, the sorted list comprises elements of a standard computer system keyboard and keyboard applications on a graphical user interface.
[13] According to yet another embodiment, the sorted list further comprises a dictionary and a phonebook, both using a natural language alphabet or a type of tables in a database using an alphabet that comprises entries selected from a finite number of possibilities.
[014] According to another embodiment, the method includes padding each name with spaces so that all names have a same width as a longest name.
[015] According to another embodiment, a resulting width and a number of rows are preceded by flattening, a list to be compressed, by moving consecutively across the columns of the matrix configuration.
[016] According to another embodiment, the method further includes encoding using arithmetic coding with a set of probabilities for each character depending on character in a vicinity of a first predetermined character to be compressed.
[011 According to yet another embodiment, the probabilities are determined and adjusted for arithmetic coding.
[018] According to another embodiment, the method further includes adding a space or a padding character as a preceding character or a succeeding character.
[019] According to a second aspect of the present invention, there is provided a system of lossless compression of sorted data, the system including one or more processors that further comprises at least one memory, at least one decoder and at least one encoder, the memory receiving an input for compression through the decoder and the encoder, the system configured to perform operations that include: compressing, using the input for compression, a sorted list of strings formed of characters from a fine literal alphabet, each string forming a row of a plurality of rows of a matrix configuration, and each consecutive character occupying a respective consecutive column of a plurality of columns of the matrix configuration; moving, using the input for compression, down each column in a sequence; and for each character identified as occupying a Hock of characters having identical rows in previous columns relative to traversing a predetermined column of the plurality of columns, establishing, using the input for compression, a probability distribution for an identity of a next character of the characters during a traversal and a screening of the characters.
[020] According to a third aspect of the present invention, there is provided a non-transitory computer readable medium having stored thereon software instructions that, when executed by a processor, cause the processor to perform operations including: compressing a sorted list of strings formed of characters from a fine literal alphabet, each string forming a row of a plurality of rows of a matrix configuration, and each consecutive character occupying a respective consecutive column of a plurality of columns of the matrix configuration; moving down each column in a sequence; and for each character identified as occupying a block of characters having identical rows in previous columns relative to traversing a predetermined column of the plurality of columns, establishing a probability distribution for an identity of a next character of the characters during a traversal and a screening of the characters.
Brief description of drawings
[021] The invention will be described by way of example only, with reference to the drawings: [022] Figure 1 depicts an example of an identified k-block according to the invention; [023] Figure 2 depicts an encoding process according to the invention; [24] Figure 3 depicts a decoding process according to the invention. Overview [25] A list of alphabetical words can be compressed efficiently because assuming equal likelihood of appearance of each letter before the list is sorted in alphabetical order, we can work out the probabilities of the appearance of a letter in a given position in a given word (itself in a given position within the entire list). We first turn the list of words into a table by padding with a special 'space' character until every word is as long as the longest word in our list We then place each word in its own row and each letter of a given word in its own column. If we then flatten the list by going down each row before moving onto the next column, we find that we are able to work out the probability of the next letter using only information we have collected so far, i.e. without recourse to any 'future' information except the total table height and width. This means that both encoder and decoder can work out the probability distribution for the next letter using the same information and thus build the probability list required for arithmetic coding for that particular position in our flattened list.
[26] This technique faces two obstacles. The first is performance, because the probability calculations require working with factorials that can get arbitrarily large. The second is that as we move through our flattened list, each new column results in a deterioration of compression ratio. There are many available methods for dealing with the former, but a simple solution which may prove adequate in applications with access to sufficient processor power is offered. As for the latter, the solution offered is predicated on the idea that the best results can be achieved by using the current method up to whichever column offers a compression ratio superior to the alternatives, and an alternative method of compression for the remainder. Determining a suitable colurrui number for the compression method switch allows the implementer to optimise the compression ratios of the method.
Working out the probabilities of each letter in a given position 10271 Consider the following column of data from a personal phonebook: H.281 Alex [029] Alice [0301 Aline [31] Avery [32] Bev [33] [0341 10 io351 Zach [36] We begin with padding each name with spaces so that all names have the same width as the longest name. 'then we note the resulting width (hereafter referred to as N) and the number of rows (hereafter referred to as R) and proceed by flattening our list by going down the first column followed by the second, and so forth to get: [37] AAAAB..Z1llve...a elle\ ...c..
[38] We now encode using arithmetic coding with the set of probabilities for each character depending on characters in the vicinity of our character. Arithmetic coding enjoys many varieties, but in the current paper we will assume that the variety described as arithmetic coding using integers in Chapter 4 of Savood is to be employed. To describe the determination of the probabilities we adopt the following notation: I/ Alphabet size including the space/padding character (or alternatively, an end-of-word character).
4 The ordinal value of the current letter (being encoded during encoding and thc letter WC arc looking for in decoding).
4 The ordinal value of the previous letter in the flattened list k-block A block of rows for which the name truncated at (but excluding) the column number of the current letter remains the same in the vicinity of the current letter.
k The number of rows in a k-block.
k The number of k-blocks above a predetermined threshold!.
The current row number within the block defined by k, from 1 to k.
R The total number of rows in the entire table.
The current row number within the entire list to be encoded, from 1 to R The number of characters in the longest word.
10391 Fig 1 provides an example of these designations. As we move down the third column, the block size k can be seen to be 3 because the first two letters of each of the words on the left are the same for 3 consecutive rows. Our current position is the third letter of 'Alice' whose ordinal value is designated by while the letter immediately above has an ordinal value designated by 4. With a=1, b=2, z=26, this gives us 4 = ordinal value of i = 9 and 4 = ordinal value of e = 5. Also, our present position has r = 2 since we are in position 2 of the k-block of 3. Note that if/is in the first column, k = R and r = 10401 As for our alphabet, we have two options. Either we add the space/padding character and treat that as a character preceding 'a', or we can introduce an End-offile character which appears at the end of each word. Both have the effect of reducing slightly the probability range of each of the other characters since in both cases we have to add one to the number of characters in our source. This is necessary unless we are dealing with a fixed width source such as is possible if our source is a series of numbers (because these can be padded on the left with a 0 which is one of the characters in the alphabet of numbers). The main advantage of using the end-of-word character is that both encoder and decoder can simply 'move on' if they find this character to their left and save both space and processor time.
[41] These two are our only options if we intend to compress the entire dataset with our proposed method. I Iowever, most applications will benefit from switching to an alternative compression method at a specific column determined by factors outlined in the section Optimising Compression Ratio. In this case, the likelihood of needing an end of word or space character is small.
[42] To determine the probability of the current letter being an 'i' we note that the previous letter is an 'i' so we can exclude the letters 'a' through 'h' but any letter between the 'i' of the previous letter and 'z' are possibilities. We would like to know the probability that the current letter will be 'i', the probability that it will be a and so forth. In other words we would like to know the probability distribution of our shortened alphabet (i-z.
[43] Within our k-block, if we are in row t; then it is clear that we need to arrange ii -k letters in k -r + 1 slots. Since only one sort order is acceptable we treat all different arrangements of the same letters as the same and noting that each letter 20 can appear as many times as it pleases, we note that the total number of possible p+1+k-r^ arrangements are 1/4.. ). Of these arrangements, the number of k-r+1 in-l+k-r) arrangements which begin with the letter 'I' is k-r [44] Hence, fn-1+k-r\ ) k-r prob(/, In, 1, k, r) -(n-tp+i+k-r) k-r+1) (1) See, for example, Chapter 1 from Shiryaev, 2016 for why this is the case.
[45] Since encoder and decoder both have the values of n,/, 4, rand k available, they can work out the probabilities of each letter, distribute them in alphabetical order and arrive at the same conclusions about the correspondence between letter and probability range. It is a routine matter to use the well-known principles of arithmetic coding to implement this model using this probability model. Fig 2 is a summary version of the program flow.
Encoding and Decoding [46] The method here proposed has two separate flavours. The first assumes that we will only compress the first few columns which can be compressed more efficiently than using an alternative compression mechanism and send the remainder to an alternative compressor. To do this, we need to keep track of the size of k-blocks in the current column so that before we start processing the next column, we can make a decision as to whether there are a sufficient number of 'large enough' k-blocks to allow us to achieve a better compression ratio than the alternative. For that, we define two thresholds in advance: the first, which we designate / is the minimum acceptable size of the k-block and the second, which we designate T is the number of k-blocks that we would like above this minimum acceptable size in order to continue. The variable k, holds the required information, that is, the number of k-blocks that are larger than / and at the end of each column processed, we check that k, > Tin order to decide whether we are to continue.
[47] The second flavour allows us to compress words of variable width and in the description of the encoding below this issue is briefly addressed.
[48] The encoding begins by carrying out the above mentioned 'flattening' and 25 for all words shorter than ».) adding an end-of-word character. This allows us to encode R and //: as fixed length binaries. IF R < 2" we use the first 16 bits to represent it, followed by a flag bit which tells the decoder that no more bits will be required for R. Tf more are required, repeat this process until the R is fully encoded. Carry out a similar procedure with iv. After this, initialise as follows: [49] n = 27, k = R, r= 1, /I= 1 [50] Now we receive our first letter and determine a list of probabilities for each letter using the probability formulas above and encoding as per the variety of arithmetic coding of our choice. Next we check to see whether r = k. If not, we increment r by 1 and assign 4= 4 and repeat the above determination of probabilities, etc. Otherwise, set r = 1 and if = 1 and check to see whether r. = R, i.e. that we are at the bottom row of our current column. If we are not, carry out the procedure of working out k. If k > t then increment k, by 1 (see the Optimising Compression Ratio section for more on this). If we are at the bottom of the current column, check to see whether our current column is the last one. If so, stop. Otherwise we need to determine whether we are to continue by looking at the value of k. If k,> T we increment the current column number, set r. = 1, k, = 0 and carry out the procedure of working out k. If k> Ewe seth1 = 1. If k1 T, then we are likely to get better results by using an alternative compressor, so we send the remaining columns to it and then stop.
[51] If we are not using a fixed width system, then we additionally need to check whether a letter exists at all. In this case, we must have end-of-word characters added to the end of each prematurely ending word. Then if a letter does not exist, we check to see whether an end-of-word character exists to our left. If it does, then we do nothing and move on to the next letter. I fit doesn't, then we set 4 to the ordinal value of our end-of-word character. If a letter does exist, we determine its ordinal value and work out its probability range using the current values of k, rand 4 and (2).
[52] The procedure for working out k is as follows. Begin with k = 1 Note the current name up to but excluding the current letter and move down the list of names until it is different, incrementing k as you move down.
[053] The decoding procedure should be quite dear from the encoding. Since n is constant, and we are given R and 22, to find / , we need to have only n" r, 4 all of which we can work out by using only the text received so far. In other words, we mirror the encoding procedure. The procedure is completed when r = wR.
Improving Performance io541 As it stands, this procedure is too expensive to be practical in all but the smallest files. The main burden on the processor is the calculation of the probabilities. As factorials get very large very quickly the fact that there is no obvious limit to the difference between k and r (indeed we begin with r = 1 and our initial k is the number of entries in our phonebook which can potentially be in the millions) weighs on the processor. To address this, we need to rewrite the ratio between our numerator and our denominator which is easy to do. After some rearrangement we get the required ratio as prob(/,1 /p, (M + n -ip + 1)(M + 71. -lp)(fri n -1p -1) ... (M + -1c + 1) [055] Where IVI = k -r [056] Now assume that we work only with integers for our arithmetic coding and use a total range of, say, 1024 so that our probability distribution looks something like the below (with 4 = ord(D)): D 0-215 E 216-330 Z 1020-1023 [057] Then if with the normalised range (hereafter nr) of 1024 we end up with probability ranges smaller than 1, we will have to make the necessary adjustments to make sure every member of the alphabet between 4 and n receives its due range.
(M + 1)(n -1p) (71 -1p - (n -1, + 1) (2) [58] But (2) makes clear that as k -rgrows the first term i.e. (M+1) approaches 1 while the remaining terms get increasingly smaller so we (111+n-ip +1) can be guaranteed a value of k -r above which we will fall below the minimum range required to work with integers. As normalised ranges arc necessary to work with arithmetic coding in integers, there is no choice but to lose some accuracy in the probability values bui as we are going to lose that precision there is hide point in increasing the value of k -r above that range.
[059] It is not difficult to see that if M > krmax where krmax = nr(n -1)(n-/p -1) (n -/, + 1) then the resulting probability range will be smaller than 1 with total range tit). Indeed, (n -/p)(n --1) (n -/, + 1) (M + n -/p)(M + n -lp -1) (M + n -ic) (n -/)(n-I -1) (n -/c + 1) A4(tc-ip) (n -1p)(n -/p -1) (n -lc + 1) 1 71,(1-/p)(n -/p -1) (n -lc + 1) nr [60] So wherever k -r> krrn". we can replace the expression k -rin the probability formula with kr,",,, without loss of precision.
[61] This on its own would hardly suffice to improve our performance because 20 working out roots is itself a very expensive process. However, kr, is a function only of /,. and 4 and since the normalised range and alphabet size will be known in advance of the process, a lookup table of size 112 can be prepared in advance so that the calculation is simply replaced by a 2-dimensional lookup table that can reside in IT1CIT10117.
10621 These considerations also furnish us with a mechanism for giving fine-grained control over the balance between performance and compression ratio. The application can determine a user defined kikax value such that the probabilities can be calculated using min(user defined ki;,"x, kr,".".) . Thus increasing k&,". results in an increase of compression ratio while a decrease results in an improvement in performance.
Optimising Compression Ratio 10631 Our proposed method has no ability to compress where there is no alphabetical order and clearly after the first few columns, the maximum size of k-blocks (hereafter k) quickly approaches 1. When for a given column krn = 1 all subsequent columns will also have kn,,,,= 1 and so at the very least, we must switch to a different compression method once we have established that the next column satisfies this condition. The likelihood of getting better compression ratios by switching before ki.= 1 is of course very high and just how much before we ought to switch depends on the alternative compression method. And since this will depend on the application at hand, we cannot provide any definitive answer.
10641 However, we can determine the probability of having at the very least 1 k-block of size 2, which is the bare minimum condition that ought to be satisfied for any compression to be possible at all using our model. The benefit of doing this is that a simple calculation carried out only once at the start of the process can act as a guide for when to switch to an alternative compression mechanism. The simplest way of doing this is to consider the probability of having any two identical words to the left when we are in column n, + 1. With an alphabet size of n we have a total of if possible words to the left and each of our R positions can be occupied by any one of them giving us a total of if' -possibilities. To find all possible combinations of unique words, the first word may be any of the if possible words, the second can be any of it -1 words, the third any of it -2 words and so forth. Since we have R words, that gives us (nw)R = (nw)(nw -1) ... (LW -R + 1) combinations of unique words. The probability that all words to the left will be (nlv)R unique is thus -and the probability of at least one k-block of size 2 is nwR (nlv)R. 1 - W
e note that since (0)R = 0, whenever R > it we are guaranteed the wR existence of at least 1 k-block of size 2. Thus using the English alphabet case-insensitively, the fourth column is guaranteed some potential for compression if we have more than 26 = 17,576 words, and if we are compressing numbers we 10 would require 10' = 1, 000 rows of data.
[65] Unfortunately, knowing the probabilM, of at least one k-block of size 2 is only a very rough guide and of limited use in any applications in which alternative compression methods can be applied with better compression ratios for block sizes of 2. And generalising this to k-blocks of size greater than 2 requires a calculation that is prohibitively expensive, though for a given application with a predetermined and range of likely input file sizes known in advance, lookup tables may offer a potential solution.
[66] Given these difficulties, a much simpler alternative is hereby proposed. As the encoder moves down a given k-block it counts the number of repetitions of each letter and if this number passes a certain threshold a sufficient number of times, the decision is made to encode the next column, otherwise the next column is passed on to the alternative compression method. The decoder does the same as it decodes each letter, so there is no need for this information to be sent by the encoder.
[067] The question that remains is what this threshold ought to be, i.e. what is the optimum minimum acceptable size of a k-block? For this, we consider the entropy of a k-block for a given alphabet size a as a function of r, i.e. the function f (r) = - P(1,1r)1092P(1,1r) lc=1 in order to collect sufficient information to make an informed estimate of the number of bits required to encode the k-Hock which we can then compare with the corresponding figure achievable using the alternative method. We therefore 5 proceed to investigate entropy in a block of size k.2 [068] Clearly, entropy is not in general constant as we move down a column in a k-block. While the entropy of the first row can be determined since 4 = 1 for the first row, the entropy of the second row will depend on the letter of the first row which we cannot know in advance. It is nevertheless easy to see that it can increase or decrease. To see that it can increase, imagine the extreme situation in which the first row receives a Z. The probability of Z in the second row and every subsequent row is then 1 and the probability of every other letter is 0. This gives us an entropy of 0 which is guaranteed to be lower than the entropy of the first row for all n > 1. Conversely, to see that it can decrease note that if 4 remains stubbornly at 1 as we move down the k-block, i.e. we get to the bottom row of our k-block and our previous letter is still only an A, then the probability of each letter in the last row is the same, and we have maximum entropy for this last row. Indeed the numerator from (1) is then (n-olc) = 1 for every b and as 4 = 1 no letter is excluded.
[0691 So there is no meaningful minimum or maximum entropy for a k-Hock. I towever, it is not difficult to work out the aptiori probability for a given letter in a given row of a k-block and by establishing the entropy of each row we can assign an average to each k-block. Since the probability of a particular row depends only Entropy is a very good guide to the number of bits required to compress each letter. For more on how close arithmetic coding can get to entropy, see Section 2.14 on Arithmetic Coding in Salomon, 2000.
on 4, i.e. the outcome of the previous row, we can use the standardised language of Markov chains, noting that because the transition matrices change as we progress, the chain formed is nonhomogcncous.
[070] Write Ito denote the 1 x a matrix of initial probabilities, i.e. the probabilities of each / for the first row: / = [P(1, = 11/p = 1,r = WO, = 21/p = 1,r = 1) = 261/p = 1,r = 1)1 ff Write ink for the n x n transition matn between the (p -1)st row and the ph row of a k-block where the i th row represents the probabilities of each 4 given 4 = r = p for that k-block. Then = 111p = 1,r = p) = 21/p = 1,r = p) = 26112, = 1,r = P(/, = 26117, = 2,r = = 11p= 2,r = p) = 21/p = 2,r = p) = 261/p = 26,r = P(1, = 11/p = 26,r = p) = 261/p = 26,r = p) k * * * * * * P(/, = 11/p = 1, r = p) P(/, = 21/p = 1,r = p) * ** P(/, = 261/p = 1,r = p) 0 P(/, = 21/p = 2, r = p) *** P(le = 261/p = 2, r = p) 0 0 * ** /3(/, = 261/7, = 26,r = where the value of each entry can be calculated using (1) and the values of 4, 4" k and r [0711 The aptiori set of probabilities for row r is then given by the 1 x a matrix D(r) _ (1) (2) k IP1P2 * Pni = IMk filk * filk(r) and the entropy of the fth row is given by -Eni=i pi 1o92 pi where the A are the entries of Pk(r) Since we now have the aprinri entropy for each row, calculating the average entropy of a k-block for a given value of n is a straightforward procedure.
Table 1: Entropy by row number 3 For more on nonhomogencous Markov chains, see, for example, Chapter 7 of Tosifcscu, 1980.
A Priori Probability Distribution for ii = 10 (letters A-J) A B C D E F G H I J Entropy 1 0.526 0.263 0.124 0.054 0.022 0.008 0.002 0.001 0.000 0.000 1.7981 2 0.263 0.279 0.209 0.130 0.070 0.032 0.013 0.004 0.001 0.000 2.4229 3 0.124 0.209 0.223 0.186 0.129 0.075 0.036 0.014 0.004 0.001 2.7377 4 0.054 0.130 0.186 0.200 0.175 0.127 0.076 0.036 0.013 0.002 2.9019 0.022 0.070 0.129 0.175 0.191 0.172 0.127 0.075 0.032 0.008 2.9735 6 0.008 0.032 0.075 0.127 0.172 0.191 0.175 0.129 0.070 0.022 2.9735 7 0.002 0.013 0.036 0.076 0.127 0.175 0.200 0.186 0.130 0.054 2.9019 8 0.001 0.004 0.014 0.036 0.075 0.129 0.186 0.223 0.209 0.124 2.7377 9 0.000 0.001 0.004 0.013 0.032 0.070 0.130 0.209 0.279 0.263 2.4229 0.000 0.000 0.001 0.002 0.008 0.022 0.054 0.124 0.263 0.526 1.7981 [073] Table 1 shows entropy values for each row of a k-block of size 10, for an alphabet of size 10. The average entropy is 2.57 so we can expect, with our arithmetic coding, to achieve close to 2.57 bits per letter. Tn uncompressed form, we would require 4 bits per letter given that we have 10 letters (though of course we could manage as many as 16 letters with 4 bits) but what matters to our method is the average bits per letter that our alternative compressor could achieve.
Fable 2: Entropy of first row by k-block size Probability distribution for first row of k-block k A B C DE F Gill I Entropy 0.357 0.247 0.165 0.105 0.063 0.035 0.017 0.007 0.002 0.000 2.4012 0.526 0.263 0.124 0.054 0.022 0.008 0.002 0.001 0.000 0.000 1.7981 0.625 0.245 0.089 0.030 0.009 0.002 0.001 0.000 0.000 0.000 1.4698 0.690 0.222 0.066 0.018 0.004 0.001 0.000 0.000 0.000 0.000 1.2572 0.735 0.201 0.050 0.011 0.002 0.000 0.000 0.000 0.000 0.000 1.1059 0.769 0.182 0.039 0.008 0.001 0.000 0.000 0.000 0.000 0.000 0.9917 0.795 0.166 0.032 0.005 0.001 0.000 0.000 0.101 0.000 0.000 0.9018 0.816 0.153 0.026 0.004 0.001 0.000 0.000 0.000 0.000 0.000 0.8289 0.833 0.142 0.022 0.003 0.000 0.000 0.000 0.000 0.000 0.000 0.7684 0.847 0.132 0.018 0.002 0.000 0.000 0.000 0 000 0.000 0.000 0.7172 [0741 Table 2 shows entropy values for the first row of k-blocks of different sizes. Entropy falls dramatically on the extremities, but only moderately in the middle, as Table 3 shows. Table 4 can be used as a lookup table that can be used to determine 5 the required threshold (1) for ki in Figures 2 and 3.
Table 3; Entropy of middle row (largest integer smafler than k/2 = smallest integer greater than k-block size Probability distribution for middle row of k-block k A B C D E F G II I J Entropy 0.110 0.165 0.180 0.168 0.140 0.105 0.070 0.040 0.018 0.005 2.9907 0.022 0.070 0.129 0.175 0.191 0.172 0.127 0.075 0.032 0.008 2.9735 0.019 0.069 0.138 0.193 0.207 0.175 0.117 0.059 0.021 0.004 2.8839 0.009 0.044 0.107 0.176 0.214 0.200 0.143 0.075 0.027 0.005 2.8453 0.009 0.047 0.115 0.188 0.223 0.198 0.132 0.064 0.020 0.003 2.8072 0.006 0.035 0.097 0.174 0.224 0.213 0.149 0.075 0.024 0.004 2.7863 0.007 0.037 0.104 0.184 0.230 0.210 0.140 0.066 0.020 0.003 2.7654 0.005 0.030 0.091 0.173 0.229 0.220 0.153 0.074 0.023 0.003 2.7526 0.005 0.033 0.097 0.181 0.234 0.217 0.145 0.067 0.019 0.003 2.7395 0.004 0.028 0.087 0.171 0.232 0.224 0.155 0.073 0.022 0.003 2.7309 Tab 4: En PY erages by lock k Entropy k Entropy k Entropy 2.6522 55 2.2854 2 3.1036 15 2.5337 60 2.2754 3 3.0304 20 2.4611 65 2.2668 4 2.9504 25 2.4119 70 2.2593 2.8797 30 2.3764 75 2.2527 6 2.8194 35 2.3494 80 2.2469 7 2.7679 40 2.3283 85 2.2417 8 2.7238 45 2.3113 90 2.2371 9 2.6856 50 2.2972 95 2.2328 Compression Ratios for Randomly Generated Words Sorted Alphabetically 10751 Table 5 shows some compression ratios for an alphabet size of 10 with a randomly generated data set (each letter between a and j had equal probability of appearing before sorting). The original sizes reflect the standard use of 8 bits per character, and not the 4 with which it is possible to encode ten numbers so the data should be adjusted accordingly. The data which furnished the result for width of column = 3, and number of rows = 100 is the following: [0761 aac, adh, afa, age, alb, ajj, bbe, bbi, bda, bdi, bef, beh, bib, bjf, ccd, cch, cch, cfd" cha, cjb, cjd, cjh, dag, dai, dcg, dcg, dci, dcj, dcj, dcla, deb, dcj, dfb, dfi, dhc, dhh, die, dii, dii, eci, ecih, efa, efg, ega, egj, eie, fbj, fcb, fce, fea, feg, fff, fgd, fid, Of, gac, gba, gbd, gbh, gca, gcj, gea, geh, gfe, gff, ggd, ghb, gib, gjj, hba, hbc, hbh, hcg, hee, hfe, hfe, hff, hia, hic, iad, iaf, ibb, ibi, iea, ieb, ief, ieg, ifa, igh, ihc, ihg, ihj, iig, jdb, jdd, jgc, jgg, jgh, jjd, jjf Table 5: Compression rajos by columns and rows Width of column Number of rows Compressed size/Original size 2 100 0.137500 2 1000 0.035937 2 10000 0.010238 0.216667 1000 0.095583 3 10000 0.026446 4 100 0.270000 4 1000 0.167906 4 10000 0.073659 100 0.303750 1000 0.220300 10000 0.136255 6 100 0.323958 6 1000 0.255063 6 10000 0.184400 7 100 0.339643 7 1000 0.280393 7 10000 0.219898 8 100 0.351562 8 1000 0.299781 8 10000 0.246270 Applications [0771 The applicability of the model is very wide, but in its current form likely to benefit more from scenarios in which compression ratio is prioritised over performance. These include: Bank statements or other data where a date column constitutes one of a small number of columns. Dates placed in ISO format offer a fixed width alphabetisation with ii = 10 and offer very favourable entropies. Dictionaries in computer programs including other compression mechanisms which are used in nearly all modern programming languages are normally alphabetised. Key-value data such as registries are present in all sorts of devices from microcontrollers to full-fledged operating systems. Indexes and primary-foreign key tables in relational database systems can get extremely large and often require long-term storage. Indeed, long-term storage of databases with tables that have multiple columns that arc likely to be repeated arc prime candidates because they can offer very large k-block sizes that are known in advance. More commercially, phone books or other similar alphabetical listings with a large number of entries such as catalogues or databases especially for longer term storage can all benefit from the method proposed here.
[0781 The invention has been described with reference to certain preferred embodiments. The invention is not limited to these embodiments. The scope of the invention is limited to the claims.
Bibliography Marius Iosifescu, Finite Markor Process-es and Their Applicatzo. Chichester: Wiley, 1980.
David Salomon, Data Compression: lhe Complete Reference, 121111 Edition. New York: Springer, 2000.
Khalid Sayood, Introduction to Data Com scion, 5th Edition New York: Morgan Kauffman, 2017.
Vladimir Semenyuk and Serge Volkoff, US Patent USS:537038B1. Efficient COMPITSSiOn method "'bi-sorted data representations 2010.
Albert N. Shiryaev, Probability Volt, "11)ird Edition London: Springer, 2016.

Claims (16)

  1. Claims 1. A method of lossless compression of sorted data, comprising: compressing, at one or more processors, a sorted list of strings formed of characters from a finite alphabet, each string forming a row of a plurality of rows of a matrix configuration, and each consecutive character occupying a respective consecutive column of a plurality of columns of the matrix configuration; moving, by the one or more processors, down each column in a sequence; and for each character identified as occupying a block of characters having identical rows in previous columns relative to traversing a predetermined column of the plurality of columns, establishing, by the one or more processors, a probability distribution for an identity of a next character of the characters during a traversal and a screening of the characters.
  2. 2. The method according to claim 1, wherein the compressing sorted list is 15 configured by a context-based arithmetic coding that achieves a compression ratio by a highly skewed probability distribution model.
  3. 3. The method according to claim 1, further comprising determining a suitable column number for the compression method that allows an implementer of the 20 method to optimise all compression ratios.
  4. 4. The method according to claim 1, further comprising compressing a first few columns of the matrix configuration and delaying compression of remaining columns.
  5. 5. The method according to any of the preceding claims, wherein the compressing comprises determining whether a better compression ratio can be attained.
  6. 6. The method according to any of the preceding claims, further comprising 5 defining at least two thresholds that are: designating a minimum acceptable size of a number of blocks to be compressed; designating a number of blocks that is preferred over a minimum acceptable size to be compressed.
  7. 7. The method according to claim 1, further comprising during an encoding 10 process, compressing words of variable widths.
  8. 8. The method according to claim 1, wherein the sorted list comprising elements of a standard computer system keyboard and keyboard applications on a graphical user interface.
  9. 9. The method according to claim 8, wherein the sorted list further comprising a dictionary and a phonebook, both using a natural language alphabet or a type of tables in a database using an alphabet that comprises entries selected from a finite number of possibilities.
  10. 10. The method according to claim 8 or 9, further comprising padding each name with spaces so that all names have a same width as a longest name.
  11. 11 The method of claim 9 or 10, wherein a resulting width and a number of rows are preceded by flattening, a list to be compressed, by moving consecutively across the columns of the matrix configuration.
  12. 12. The method according to claim 10 or 11, further comprising encoding using arithmetic coding with a set of probabilities for each character depending on character in a vicinity of a first predetermined character to be compressed.
  13. 13. The method according to claim 11 or 12, wherein the probabilities are determined and adjusted for arithmetic coding.
  14. 14 The method according to claim 12 or 13, further comprising adding a space or a padding character as a preceding character or a succeeding character.
  15. 15. A system of lossless compression of sorted data, the system comprising one or more processors that further comprises at least one memory, at least one decoder and at least one encoder, the memory receiving an input for compression through the decoder and the encoder, the system configured to perform operations including: compressing, using the input for compression, a sorted list of strings formed of characters from a fine literal alphabet, each string forming a row of a plurality of rows of a matrix configuration, and each consecutive character occupying a respective consecutive column of a plurality of columns of the matrix configuration; moving, using the input for compression, down each column in a sequence; and for each character identified as occupying a block of characters having identical rows in previous colunms relative to traversing a predetermined column of the plurality of columns, establishing, using the input for compression, a probability distribution for an identity of a next character of the characters during a traversal and a screening of the characters.
  16. 16. A non-transitory computer readable medium having stored thereon software instructions that, when executed by a processor, cause the processor to perform operations including: compressing a sorted list of strings formed of characters from a fine literal alphabet, each string forming a row of a plurality of rows of a 5 matrix configuration, and each consecutive character occupying a respective consecutive column of a plurality of columns of the matrix configuration; moving down each column in a sequence; and for each character identified as occupying a block of characters having identical rows in previous columns relative to traversing a predetermined column of the plurality of columns, establishing a probability distribution for an identity of a next character of the characters during a traversal and a screening of the characters.
GB2007278.1A 2020-05-16 2020-05-16 Lossless compression of sorted data Active GB2595002B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB2007278.1A GB2595002B (en) 2020-05-16 2020-05-16 Lossless compression of sorted data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB2007278.1A GB2595002B (en) 2020-05-16 2020-05-16 Lossless compression of sorted data

Publications (3)

Publication Number Publication Date
GB202007278D0 GB202007278D0 (en) 2020-07-01
GB2595002A true GB2595002A (en) 2021-11-17
GB2595002B GB2595002B (en) 2022-06-15

Family

ID=71135127

Family Applications (1)

Application Number Title Priority Date Filing Date
GB2007278.1A Active GB2595002B (en) 2020-05-16 2020-05-16 Lossless compression of sorted data

Country Status (1)

Country Link
GB (1) GB2595002B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114500670B (en) * 2022-02-28 2024-04-05 北京京东振世信息技术有限公司 Encoding compression method, decoding method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5587710A (en) * 1995-03-24 1996-12-24 National Semiconductor Corporation Syntax based arithmetic coder and decoder
US6075471A (en) * 1997-03-14 2000-06-13 Mitsubishi Denki Kabushiki Kaisha Adaptive coding method
US20090079602A1 (en) * 2007-09-19 2009-03-26 Vivienne Sze N-BIN Arithmetic Coding for Context Adaptive Binary Arithmetic Coding

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5587710A (en) * 1995-03-24 1996-12-24 National Semiconductor Corporation Syntax based arithmetic coder and decoder
US6075471A (en) * 1997-03-14 2000-06-13 Mitsubishi Denki Kabushiki Kaisha Adaptive coding method
US20090079602A1 (en) * 2007-09-19 2009-03-26 Vivienne Sze N-BIN Arithmetic Coding for Context Adaptive Binary Arithmetic Coding

Also Published As

Publication number Publication date
GB202007278D0 (en) 2020-07-01
GB2595002B (en) 2022-06-15

Similar Documents

Publication Publication Date Title
Pitas Fast algorithms for running ordering and max/min calculation
KR101515660B1 (en) Two-pass hash extraction of text strings
JP3337633B2 (en) Data compression method and data decompression method, and computer-readable recording medium recording data compression program or data decompression program
US11722148B2 (en) Systems and methods of data compression
CN106202172B (en) Text compression methods and device
US5585793A (en) Order preserving data translation
Klein et al. Random access to Fibonacci encoded files
JPS6356726B2 (en)
Nunes et al. A grammar compression algorithm based on induced suffix sorting
GB2595002A (en) Lossless compression of sorted data
JP6467937B2 (en) Document processing program, information processing apparatus, and document processing method
Steinruecken Lossless data compression
WO1997033254A1 (en) System and method for the fractal encoding of datastreams
US20220368345A1 (en) Hardware Implementable Data Compression/Decompression Algorithm
Klein et al. Huffman coding with non-sorted frequencies
Drmota et al. Redundancy of lossless data compression for known sources by analytic methods
US8872679B1 (en) System and method for data compression using multiple small encoding tables
US20220358290A1 (en) Encoding and storing text using dna sequences
Jain et al. A comparative study of lossless compression algorithm on text data
US8010510B1 (en) Method and system for tokenized stream compression
Klein et al. Random Access to Fibonacci Codes.
US11411578B1 (en) Bit reordering compression
Rani et al. A survey on lossless text data compression techniques
KR101069667B1 (en) Method and System for storing telephone numbers
Gagie et al. Worst-case optimal adaptive prefix coding