KR20160100496A

KR20160100496A - Improved huffman code method and apprartus thereof by using binary clusters

Info

Publication number: KR20160100496A
Application number: KR1020150022947A
Authority: KR
Inventors: 김정훈
Original assignee: 김정훈
Priority date: 2015-02-15
Filing date: 2015-02-15
Publication date: 2016-08-24

Abstract

Provided is a data compressing method using Huffman coding as a method for compressing data regarding arbitrary binary numeral data. When the Huffman coding is performed, the present invention can perform more effective compression by improving a method for performing compression by calculating a Huffman sign from appearance frequency of a fixed length symbol after dividing data with an existing fixed length, and then performing the compression by calculating the Huffman coding by dividing the data with a variable length regarding arbitrary binary data. The Huffman coding produces a variable length code from the fixed length symbol. An LZW compressing method is a compressing method to produce fixed length code from a variable length symbol. The present invention is a design regarding the compressing method to produce the fixed length code from the variable length symbol by improving the Huffman coding.

Description

[0001] DESCRIPTION [0002] IMPROVED HUFFMAN CODE METHOD AND APPARATUS THEREOF BY USING BINARY CLUSTERS [0003]

Data compression using Huffman coding in information theory

Huffman coding method

Detailed description of specific embodiments for carrying out the invention

The compression efficiency of the Huffman code can be further improved by using the binary cluster.

For example, as one embodiment, there is a data compression method using Huffman coding as a method of compressing data for arbitrary binary number data. The present invention improves a method of dividing data into a fixed length by dividing data into a fixed length and calculating a Huffman code from the frequency of appearance of the fixed length symbol to perform compression, , And Huffman coding is calculated and compression is performed, so that more effective compression can be performed.

Huffman coding generates a variable length code from a fixed length symbol and the LZW compression method is a compression method that generates a fixed length code from a variable length symbol.

The present invention is a scheme for a compression method for improving Huffman coding to generate a variable length code from a variable length symbol.

First, any binary data

0101000001001011000000000000000000100000000000000000000000000000000000000000100001000000001101010011110010001101000101101110110100000001000000000000000001111000000011000000000000000 ....

A binary number "10" of 2 bits is added before the most significant bit. By doing this, arbitrary binary data can be changed to start with "10 ", and the added" 10 "is referred to as a compulsory header (hereinafter referred to as CH) in the present invention. This CH may vary, since it is necessary to start any binary number with a "10 ", so that if CH starts with" 10 " and is a binary number consisting of zero or more consecutive "0" Everything is possible.

100 "," 100 "," 100 "," 101 "," 1001 "," 1011 "," 1000001 ", and so on. However, in decryption, what CH You should be aware of the information. In this embodiment, since the CH is short, an example in which "10", which is a binary number starting from the smallest "10", is added to CH before the most significant bit of an arbitrary binary number.

If there is data of 876,040 bits of arbitrary binary data, "10" is added to the most significant bit, and Huffman compression is performed according to the present invention for 876,042 bits of binary data to be compressed.

Step 1: For binary data to be compressed first, binary data is separated each time it encounters "10 " in the direction from the most significant bit to the least significant bit, or vice versa.

The binary data is separated at the point where the change of the bit value occurs. Let's call it a binary cluster.

For example, if you look at the binary number "101110001110", you can separate the binary data at the points where you first encounter "10", as shown below.

1011/100011/10

In the same manner as described above, 876,042 bits of binary data to be compressed

100101000001001011000000000000000001000000000000000000000000000000000000000001000010000000011010100111100100011010001011011101101000000010000000000000000011110000000110000000000000 ....

To "10" for the first time. Let's call each of these binary chunks a binary cluster. For the sake of brevity, binary clusters generated in this way can be expressed as binary numbers beginning with binary "10" followed by zero or more binary "0" followed by zero or more binary "1".

100/10/100000/100/101/10000001/100000/100000/10/1000000000000000/10000000000000/100000000000000000000000000000/10000/100000000001/10/10/100111/100/1000 / 10000000/100000000000000000111/100000001/1000000000000000 ....

Step 2: Next, as shown in Table 1 below, the result of dividing a total of 876,042 bits into binary cluster units is sequentially shown. Only 30 binary clusters were shown sequentially.

Binary cluster 100 10 100000 100 101 10000001 100000 100000 10 10000000000000001 10000000000000 100000000000000000000000000000 10000 100000000001 10 10 100111 100 10001 10 1000 101 1011 101 10 10000000 100000000000000000111 100000001 1000000000000000000000 1001 ...

In the case of the binary data of this embodiment, it can be seen that it is divided into 207,982 binary clusters.

Table 2 shows the binary clusters in Table 1, and the frequency of appearance of each cluster is shown in Table 2. Some contents are omitted because of the ground relationship.

Binary cluster symbol Appearance frequency 10 53632 100 24900 1000 12406 10000 6136 100000 3614 1000000 1380 10000000 1682 100000000 417 1 billion 1406 10000000000 87 100000000000 174 1000000000000 33 10000000000000 82 100000000000000 12 1000000000000000 4 10000000000000000 7 100000000000000000 11 1000000000000000000 46 10000000000000000000 17 100000000000000000000 13 1000000000000000000000 16 10000000000000000000000 4 100000000000000000000000 One 1000000000000000000000000 3 10000000000000000000000000 One 100000000000000000000000000 2 ... . ... . 1001 12260 10011 6048 100111 3387 1001111 1631 10011111 771 100111111 350 1001111111 198 10011111111 107 100111111111 70 1001111111111 49 10011111111111 12 100111111111111 11 1001111111111111 9 10011111111111111 3 100111111111111111 3 1001111111111111111 2 101 25972 1011 12825 10111 6687 101111 3351 1011111 1565 10111111 818 101111111 366 1011111111 239 10111111111 136 101111111111 117 1011111111111 48 10111111111111 20 101111111111111 16 1011111111111111 2 10111111111111111 5 101111111111111111 2 1011111111111111111 One

Huffman coding is performed using the binary distribution table of Table 2 above.

Huffman coding is a kind of entropy coding used in lossless compression. It is an algorithm that produces a prefix code (a code in which a symbol of one symbol does not become a prefix of another symbol code) from the frequency of a symbol. Allocate longer code, and allocate shorter codes for the more frequent symbols. In the present invention, an algorithm for generating a variable length code in a "variable length" symbol is developed and utilized in the feature of this Huffman hatching .

There are many ways to design Huffman codes. One example is a method of generating Huffman codes using a binary tree. In the binary tree structure, an external node or a leaf node represents a corresponding symbol. The Huffman code for an arbitrary symbol starts from a root node and descends to a leaf node corresponding to a symbol, and generates a code word assigned to each branch by mode combining.

In the figure, in the case of the A symbol, "0", which sequentially applies the bits assigned to the arm, which is the route descending from the root node through the inner node, is the Huffman code,

10 "for symbol B," 110 "for symbol C, and" 111 "for symbol D, respectively.

The Huffman code of the minimum mutation among the methods of constructing the Huffman tree is described as an example. When the two symbols having the lowest probability or frequency are combined, the sum of the probability of occurrence of two symbols or the sum of the frequencies is The code is re-sorted by comparing with the probability of the remaining symbols, and then the code generation process is repeated.

The minimum variation Huffman code for the set S = {a, b, c, d, e} consisting of five symbols is 2 times, 4 times , 2 times, 1 time, 1 time. Of course, the probability of occurrence other than the number may be used to calculate the same calculation procedure.

In the present invention, binary clusters in which the set of symbols a, b, c, d, e, etc. appear immediately become symbols.

The minimal transition Huffman code start combines two symbols d and e with the lowest probability of occurrence or frequency of occurrence. And assigns different bits to the last bit of the codeword corresponding to the two symbols. At this time, "1" is assigned to the side where the probability of occurrence or frequency of appearance is smaller than "0" to the side of which the probability of occurrence or frequency of occurrence is higher, and when the probability or appearance frequency is the same, can do. In this embodiment, when symbols are sorted in ascending order, "0"

Next, these two symbols (d, e) are combined to create a new symbol n1. The frequency of occurrence of the new symbol n1 is equal to the sum of the occurrences of the two symbols d and e. Therefore, the frequency of n1 is 2. If there is an existing symbol having the same probability or appearance frequency as the probability of the newly generated symbol upon reordering the symbols according to the occurrence frequencies of the four symbols a, b, n1, and c, Place the symbol above the existing symbol with the same probability or frequency of occurrence. This differs from regular huffman code and minimal variation huffman code.

The above process is shown below.

This process is summarized as follows. The sum of the signs of the paths down to each leaf node is the Huffman code for each symbol.

Table 14 shows the result of generating the Huffman code by applying the actual binary cluster as a symbol to the above example.

Binary cluster symbol Appearance frequency Huffman code 10 2 10 10000 4 00 1001 2 11 100 One 010 1000111 One 011

The above Huffman coding calculation method is a result using a minimum variation tree. It is a matter of course that it is possible to perform coding and decoding using a very wide variety of known Huffman tree generation methods such as general Huffman tree and adaptive Huffman coding. In one embodiment, the Huffman coding method using a minimum variation tree is used.

Table 2 below shows Huffman coding for each binary cluster in Table 2 based on symbols and frequency. In this case, the Huffman coding scheme generates a minimum mutation tree and constructs a code tree and generates Huffman codes using the tree. Only a part of it is represented by the ground relation.

Of course, Huffman codes can be generated in a wide variety of ways such as maximum transition tree and adaptive Huffman coding, but there may be some differences in the compression ratio.

Binary cluster symbol Appearance frequency Huffman coding result 10 53632 01 100 24900 101 1000 12406 1100 10000 6136 11100 100000 3614 000001 1000000 1380 00000000 10000000 1682 0001110 100000000 417 000010110 1 billion 1406 1111111 ... . ... .. ... .. 1001 12260 1101 10011 6048 11110 100111 3387 000011 1001111 1631 0001111 10011111 771 10011101 100111111 350 111111011 1001111111 198 1001110000 10011111111 107 00001010001 100111111111 70 000010000000 1001111111111 49 000010111110 10011111111111 12 00001011111111 100111111111111 11 11111101010111 1001111111111111 9 000000111100111 10011111111111111 3 1001111100010111 100111111111111111 3 0000101000001011 1001111111111111111 2 00001000010001001 101 25972 001 1011 12825 1000 10111 6687 00010 101111 3351 000110 1011111 1565 1001011 10111111 818 10010100 101111111 366 111111000 1011111111 239 0000100010 10111111111 136 00001000001 101111111111 117 00001000111 1011111111111 48 100111110000 10111111111111 20 00000011010110 101111111111111 16 00001000011011 1011111111111111 2 00001000010001010 10111111111111111 5 0000001101011111 101111111111111111 2 00001000010001011 1011111111111111111 One 000010000111101101

Now, with the binary cluster-Huffman code conversion table in Table 3, compressed data is generated by replacing each binary cluster as shown below with a Huffman code.

The results are shown in Table 4 below. Moving from the top to the bottom, the final compressed data is generated while replacing each binary cluster with the code conversion table of Table 3. The actual compressed data is physically stored in a binary data form in which Huffman codes shown in Table 4 are successively continued in 1: 1 corresponding to the order of appearance of binary clusters.

Binary cluster symbol Corresponding Huffman code 100 101 10 01 100000 000001 100 101 101 001 10000001 000000010 100000 000001 100000 000001 10 01 10000000000000001 000000111101 10000000000000 000000110100 100000000000000000000000000000 000000111110 10000 11100 100000000001 100111110010 10 01 10 01 100111 000011 100 101 10001 11101 10 01 1000 1100 101 001 1011 1000 101 001 10 01 10000000 0001110 100000000000000000111 00001000010011010 100000001 0000001100 1000000000000000000000 00001000011000 1001 1101 ... ...

Compressed data can be generated while converting Huffman codes to binary clusters (symbols) at 1: 1 according to the binary cluster-Huffman code table of Table 3 as shown in Table 4,

Original data consisting of binary cluster symbols,

100/10/100000/100/101/10000001/100000/100000/10/1000000000000000/10000000000000/100000000000000000000000000000/10000/100000000001/10/10/100111/100/1000 / 10000000/100000000000000000111/100000001/100000000000 0000 .... The compressed data of the following form is as follows. In the following, "/" is a conceptual separation of each binary cluster symbol to represent the corresponding Huffman code, and is actually a nonexistent character.

101/01/000001/101/001/000000010/000001/000001/01/000000111101/000000110100/000000111110/11100/100111110010/01/01/000011/101/11101/01/1100/001/1000/001/01

/ 0001110/00001000010011010/0000001100/00001000011000/1101 / ..

The compression result can be calculated in advance as shown in Table 5. Some parts are omitted for the sake of the paper.

The difference between the length of the binary cluster symbol corresponding to each binary cluster symbol and the length of the Huffman code corresponding to each binary cluster symbol is a compression gain per symbol (in this specification, - (minus) is expressed as a gain), and multiplying the compression gain by the appearance frequency of the binary cluster symbol The compression gain sum is calculated for all the binary cluster symbols equally, and then the result is the final sum compression gain.

Binary cluster Huffman code Compression gain per symbol Appearance frequency Original data size Sum gain per symbol Compressed data size 10 01 0 53632 107264 0 107264 100 101 0 24900 74700 0 74700 1000 1100 0 12406 49624 0 49624 10000 11100 0 6136 30680 0 30680 100000 000001 0 3614 21684 0 21684 1000000 00000000 One 1380 9660 1380 11040 10000000 0001110 -One 1682 13456 -1682 11774 100000000 000010110 0 417 3753 0 3753 1 billion 1111111 -3 1406 14060 -4218 9842 10000000000 000000011010 One 87 957 87 1044 100000000000 00000001100 -One 174 2088 -174 1914 1000000000000 0000100000011 0 33 429 0 429 10000000000000 000000110100 -2 82 1148 -164 984 100000000000000 00001011111110 -One 12 180 -12 168 1000000000000000 0000101000001010 0 4 64 0 64 10000000000000000 000010111100010 -2 7 119 -14 105 100000000000000000 000000011101110 -3 11 198 -33 165 1000000000000000000 100111110011 -7 46 874 -322 552 10000000000000000000 00001000000100 -6 17 340 -102 238 100000000000000000000 00001011110110 -7 13 273 -91 182 1000000000000000000000 00001000011000 -8 16 352 -128 224 ... ... ... ... ... ... ... 1001 1101 0 12260 49040 0 49040 10011 11110 0 6048 30240 0 30240 100111 000011 0 3387 20322 0 20322 1001111 0001111 0 1631 11417 0 11417 10011111 10011101 0 771 6168 0 6168 100111111 111111011 0 350 3150 0 3150 1001111111 1001110000 0 198 1980 0 1980 10011111111 00001010001 0 107 1177 0 1177 100111111111 000010000000 0 70 840 0 840 1001111111111 000010111110 -One 49 637 -49 588 10011111111111 00001011111111 0 12 168 0 168 100111111111111 11111101010111 -One 11 165 -11 154 1001111111111111 000000111100111 -One 9 144 -9 135 10011111111111111 1001111100010111 -One 3 51 -3 48 100111111111111111 0000101000001011 -2 3 54 -6 48 1001111111111111111 00001000010001001 -2 2 38 -4 34 101 001 0 25972 77916 0 77916 1011 1000 0 12825 51300 0 51300 10111 00010 0 6687 33435 0 33435 101111 000110 0 3351 20106 0 20106 1011111 1001011 0 1565 10955 0 10955 10111111 10010100 0 818 6544 0 6544 101111111 111111000 0 366 3294 0 3294 1011111111 0000100010 0 239 2390 0 2390 10111111111 00001000001 0 136 1496 0 1496 101111111111 00001000111 -One 117 1404 -117 1287 1011111111111 100111110000 -One 48 624 -48 576 10111111111111 00000011010110 0 20 280 0 280 101111111111111 00001000011011 -One 16 240 -16 224 1011111111111111 00001000010001010 One 2 32 2 34 10111111111111111 0000001101011111 -One 5 85 -5 80 101111111111111111 00001000010001011 -One 2 36 -2 34 1011111111111111111 000010000111101101 -One One 19 -One 18 Sum 876042 -34466 841576

As shown in Table 5, the binary data to be compressed includes 876,042 bits including CH, and it can be seen that a compression gain of 34,466 bits occurs at 841,576 bits after compression.

In addition to the compression result data, Huffman coding is accompanied by a code dictionary for decompression, which is the same as the code dictionary in the Huffman coding method. That is, in the code prefix part, the dictionary information in which the corresponding Huffman codes are mapped for each binary cluster symbol can be added by borrowing a pre-information addition form of the known Huffman coding method. Further, in the present invention, Respectively.

A) First, in order to construct a coding dictionary, the bit length information of consecutive "0" after "10" of the binary cluster and the bit length information of consecutive "1" and the appearance frequency information are stored , It is possible to know exactly the value of the binary cluster from the bit length of consecutive "0" s after "10" of the binary cluster and the bit length information of consecutive "1", and the frequency of occurrence of the symbol It is possible to accurately reproduce the Huffman code for each binary cluster symbol by applying the Huffman tree generation algorithm in the same manner as the Huffman tree generation method used at the time of compression.

That is, when the binary cluster-occurrence frequency information is stored together with the dictionary information for decompression in the Huffman coding as shown in Table 6 below, the decompression is performed using the same Huffman tree implementation method as in compression The binary cluster-specific Huffman code can also be obtained at the decompression stage.

Binary cluster Huffman code 10 01 100 101 1000 1100 10000 11100 100000 000001 1000000 00000000 10000000 0001110 100000000 000010110 1 billion 1111111 10000000000 000000011010 100000000000 00000001100 1000000000000 0000100000011 10000000000000 000000110100 100000000000000 00001011111110 1000000000000000 0000101000001010 10000000000000000 000010111100010 100000000000000000 000000011101110 1000000000000000000 100111110011 10000000000000000000 00001000000100 100000000000000000000 00001011110110 1000000000000000000000 00001000011000 ... ... 1001 1101 10011 11110 100111 000011 1001111 0001111 10011111 10011101 100111111 111111011 1001111111 1001110000 10011111111 00001010001 100111111111 000010000000 1001111111111 000010111110 10011111111111 00001011111111 100111111111111 11111101010111 1001111111111111 000000111100111 10011111111111111 1001111100010111 100111111111111111 0000101000001011 1001111111111111111 00001000010001001 101 001 1011 1000 10111 00010 101111 000110 1011111 1001011 10111111 10010100 101111111 111111000 1011111111 0000100010 10111111111 00001000001 101111111111 00001000111 1011111111111 100111110000 10111111111111 00000011010110 101111111111111 00001000011011 1011111111111111 00001000010001010 10111111111111111 0000001101011111 101111111111111111 00001000010001011 1011111111111111111 000010000111101101

In particular, Table 6 can be expressed more compactly as shown in Table 13. The binary cluster symbols always start with "10 " due to their characteristics, and the binary lengths are 10, 100, 1000, 10000, 1011 "," 101111 ", ..., and so on, there are binary clusters that can be expressed simply as" 10 "followed by" 0 " There are binary clusters that can be expressed simply by the number of consecutive "1 " s after" 10 ". In the case of binary clusters composed of one or more consecutive "0s" followed by "10s" and one or more consecutive "1s" such as "1000111", "10000001", "100011", ..., Quot; 10 "and the number of consecutive" 0 "s and the number of consecutive" 1 " s. For example, in the case of "10 ", since the number of consecutive" 0 "s after" 10 " is 0, . In the case of "101 ", 0 is consecutively followed by " 10 ", and the consecutive" 1 "

The number of consecutive zeros The number of consecutive 1's Appearance frequency 0 0 53632 One One 24900 2 2 12406 3 3 6136 4 4 3614 5 5 1380 6 6 1682 7 7 417 8 8 1406 9 9 87 10 10 174 11 11 33 12 12 82 13 13 12 14 14 4 15 15 7 16 16 11 17 17 46 18 18 17 19 19 13 20 20 16 .. ... One One 12260 2 2 6048 3 3 3387 4 4 1631 5 5 771 6 6 350 7 7 198 8 8 107 9 9 70 10 10 49 11 11 12 12 12 11 13 13 9 14 14 3 15 15 3 16 16 2 0 One 25972 One 2 12825 2 3 6687 3 4 3351 4 5 1565 5 6 818 6 7 366 7 8 239 8 9 136 9 10 117 10 11 48 11 12 20 12 13 16 13 14 2 14 15 5 15 16 2 16 17 One

As described above, the encoding dictionary can be more effectively represented by a method that can express each binary cluster symbol with ease,

Also, in the case of the appearance frequency, the occurrence frequency for the appearance frequency values of the second column of Table 6 is separately calculated as shown in Table 7. Then, the appearance frequency value symbols are subjected to Huffman coding based on Table 7, Thus, it is possible to compress and express the information on the appearance frequency of each symbol in the dictionary information.

Binary cluster symbol Appearance frequency 10 53632 100 24900 1000 12406 10000 6136 100000 3614 1000000 1380 10000000 1682 100000000 417 1 billion 1406 ... . ... .. 1001 12260 10011 6048 100111 3387 1001111 1631 10011111 771 100111111 350 1001111111 198 10011111111 107 100111111111 70 1001111111111 49 10011111111111 12 100111111111111 11 1001111111111111 9 10011111111111111 3 100111111111111111 3 1001111111111111111 2 101 25972 1011 12825 10111 6687 101111 3351 1011111 1565 10111111 818 101111111 366 1011111111 239 10111111111 136 101111111111 117 1011111111111 48 10111111111111 20 101111111111111 16 1011111111111111 2 10111111111111111 5 101111111111111111 2 1011111111111111111 One

Appearance frequency value Appearance frequency One 44 2 28 3 21 4 12 5 4 6 5 7 4 8 2 9 4 10 One 11 5 12 2 13 3 14 One 16 4 17 3 19 One 20 3 21 One 22 3 23 One 25 One 28 2 29 One 32 One 33 One 35 One 39 One 43 2 44 2 46 One 48 2 49 One 58 One 64 One 70 One 72 2 82 2 84 One 85 One 86 One 87 One 89 One 97 2 107 One 117 One 136 One 161 One 174 One 177 One 185 One 186 One 198 One 212 One 213 One 239 One 313 One 325 One 350 One 366 One 372 One 386 One 417 One 432 One 688 One 755 One 771 One 803 One 818 One 867 One 1343 One 1380 One 1406 One 1475 One 1484 One 1565 One 1631 One 1682 One 3104 One 3277 One 3351 One 3387 One 3614 One 6048 One 6123 One 6136 One 6687 One 12260 One 12406 One 12825 One 24900 One 25972 One 53632 One

B) In the next method of constructing a dictionary for decompression,

As shown in Table 8 below, the binary cluster, the Huffman code, and the length information of the Huffman code are stored. A) method differs from Huffman code in that it does not reproduce the Huffman code again in decoding but stores the Huffman code directly in the encoding dictionary. The symbol length of the binary cluster may be expressed more effectively as shown in Table 13 of the method A, and the Huffman codes of Table 8 may be stored consecutively in the encoding dictionary. The Huffman code length, which can be divided into the respective symbol Huffman codes, The binary cluster symbols corresponding to each of the Huffman codes can be sequentially recovered from the dictionary information.

Binary cluster symbol Huffman code Length of Huffman code 10 01 2 100 101 3 1000 1100 4 10000 11100 5 100000 000001 6 1000000 00000000 8 10000000 0001110 7 100000000 000010110 9 1 billion 1111111 7 ... . ... .. ... 1001 1101 4 10011 11110 5 100111 000011 6 1001111 0001111 7 10011111 10011101 8 100111111 111111011 9 1001111111 1001110000 10 10011111111 00001010001 11 100111111111 000010000000 12 1001111111111 000010111110 12 10011111111111 00001011111111 14 100111111111111 11111101010111 14 1001111111111111 000000111100111 15 10011111111111111 1001111100010111 16 100111111111111111 0000101000001011 16 1001111111111111111 00001000010001001 17 101 001 3 1011 1000 4 10111 00010 5 101111 000110 6 1011111 1001011 7 10111111 10010100 8 101111111 111111000 9 1011111111 0000100010 10 10111111111 00001000001 11 101111111111 00001000111 11 1011111111111 100111110000 12 10111111111111 00000011010110 14 101111111111111 00001000011011 14 1011111111111111 00001000010001010 17 10111111111111111 0000001101011111 16 101111111111111111 00001000010001011 17 1011111111111111111 000010000111101101 18

We will now briefly explain the decompression process.

The decompression process is performed from the encoded dictionary and the compressed data and the CH binary number for decompression configured as described above.

First, binary cluster symbols and Huffman code dictionaries as shown in Table 6 or Table 8 are reproduced from the decompression dictionary in the form shown in Table 9. [

Binary cluster symbol Huffman code 10 01 100 101 1000 1100 10000 11100 100000 000001 1000000 00000000 10000000 0001110 100000000 000010110 1 billion 1111111 ... . ... .. 1001 1101 10011 11110 100111 000011 1001111 0001111 10011111 10011101 100111111 111111011 1001111111 1001110000 10011111111 00001010001 100111111111 000010000000 1001111111111 000010111110 10011111111111 00001011111111 100111111111111 11111101010111 1001111111111111 000000111100111 10011111111111111 1001111100010111 100111111111111111 0000101000001011 1001111111111111111 00001000010001001 101 001 1011 1000 10111 00010 101111 000110 1011111 1001011 10111111 10010100 101111111 111111000 1011111111 0000100010 10111111111 00001000001 101111111111 00001000111 1011111111111 100111110000 10111111111111 00000011010110 101111111111111 00001000011011 1011111111111111 00001000010001010 10111111111111111 0000001101011111 101111111111111111 00001000010001011 1011111111111111111 000010000111101101

Using the dictionary information shown in Table 9 above, in the following compressed data composed of continuous Huffman codes,

/ 0001110/00001000010011010/0000001100/00001000011000/1101 / ..

And sequentially recover the respective binary cluster symbols corresponding to the Huffman codes. The restoration results are shown in Table 1, and physically it will look like this.

Now stripping the conceptually separated "/" and looking at one stream, it looks like this:

Finally, since CH information "10" added at the time of compression is known, if CH is removed from the most significant bit, it is decompressed to any binary number before compression of 876,040 bits as shown below.

In order to compare the effects of the present invention, the same original binary data is divided into fixed length symbols to produce fixed length symbols, and the results are shown in the following Huffman coding scheme according to the appearance frequency of the symbols. Fixed length is one of the most commonly used 8 bits, and it can be compressed in various ways as well as fixed length.

Table 10 shows the results of dividing the same data into 8-bit fixed lengths.

8 bit fixed length symbols (decimal conversion) 80 75 3 4 20 0 6 0 8 0 0 0 33 0 53 60 141 22 237 One 0 0 120 12 0 0 19 0 8 2 ... .

Table 11 shows the results of Huffman coding according to the occurrence frequencies and appearance frequencies of the 8-bit fixed-length symbols. The Huffman encoding algorithm at this time is the Huffman encoding result using the minimum mutation TREE that is the same as the Huffman encoding algorithm applied to the present invention.

Fixed length symbols (8 bits) Appearance frequency Huffman code Original data size Compressed data size 0 5348 00000 42784 26740 One 760 1011000 6080 5320 2 625 00001111 5000 5000 3 396 10010100 3168 3168 4 891 0011111 7128 6237 5 337 11111001 2696 2696 6 398 10010000 3184 3184 7 332 000010000 2656 2988 8 872 0100110 6976 6104 9 362 11010101 2896 2896 10 354 11100010 2832 2832 11 349 11101010 2792 2792 12 319 000010111 2552 2871 13 332 11111110 2656 2656 14 342 11110110 2736 2736 15 402 10000000 3216 3216 16 882 0100001 7056 6174 17 387 10101000 3096 3096 18 345 11101110 2760 2760 19 353 11100100 2824 2824 20 416 01100101 3328 3328 21 371 11000100 2968 2968 22 383 10101100 3064 3064 23 363 11010100 2904 2904 24 298 000101100 2384 2682 25 375 10111100 3000 3000 26 409 01110001 3272 3272 27 376 10111000 3008 3008 28 364 11010000 2912 2912 29 377 10110110 3016 3016 30 369 11001010 2952 2952 ... ... ... ... ... 230 401 10000101 3208 3208 231 334 11111101 2672 2672 232 428 01010111 3424 3424 233 427 01011001 3416 3416 234 412 01101101 3296 3296 235 399 10001100 3192 3192 236 365 11001110 2920 2920 237 414 01101000 3312 3312 238 388 10100011 3104 3104 239 387 10100101 3096 3096 240 414 01101001 3312 3312 241 477 00101001 3816 3816 242 344 11110001 2752 2752 243 369 11001001 2952 2952 244 476 00101010 3808 3808 245 399 10001101 3192 3192 246 430 01010101 3440 3440 247 405 01110110 3240 3240 248 498 00100100 3984 3984 249 407 01110011 3256 3256 250 391 10011011 3128 3128 251 438 01001001 3504 3504 252 405 01110111 3240 3240 253 413 01101011 3304 3304 254 439 01000111 3512 3512 255 619 00010000 4952 4952 total 109505 876040 863345

As can be seen in Table 11, it can be seen that 876,040 bits of binary data are compressed into 863,345 bits of binary data. The fact that the original data is 2 bits smaller does not require the addition of a "10" . On the other hand, it can be seen that the result is not better than the result of compressing to 841,576 bits of the present invention. Although we did not consider a dictionary for decompression, we can use the present invention to obtain a bit reduction effect of about 22,000 bits and this trend is expected to be better than the Huffman algorithm for all data types . Since the number of symbols in Table 11 is 256 and the number of symbols is a number from 0 to 255 from "00000000" to "11111111", the complexity is great, and the number of symbols using the mosaic cluster of the present invention is only 231, It is expected that the present invention will be more effective when it is calculated by including advance information for decompression because it is a binary number of a type that can be easily expressed.

Claims

A method of dividing binary data into binary clusters and a method of compressing data by generating Huffman codes using the distribution of separated binary clusters