KR20160100496A - Improved huffman code method and apprartus thereof by using binary clusters - Google Patents
Improved huffman code method and apprartus thereof by using binary clusters Download PDFInfo
- Publication number
- KR20160100496A KR20160100496A KR1020150022947A KR20150022947A KR20160100496A KR 20160100496 A KR20160100496 A KR 20160100496A KR 1020150022947 A KR1020150022947 A KR 1020150022947A KR 20150022947 A KR20150022947 A KR 20150022947A KR 20160100496 A KR20160100496 A KR 20160100496A
- Authority
- KR
- South Korea
- Prior art keywords
- binary
- huffman
- symbol
- data
- code
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/40—Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
Data compression using Huffman coding in information theory
Huffman coding method
Detailed description of specific embodiments for carrying out the invention
Detailed description of specific embodiments for carrying out the invention
The compression efficiency of the Huffman code can be further improved by using the binary cluster.
For example, as one embodiment, there is a data compression method using Huffman coding as a method of compressing data for arbitrary binary number data. The present invention improves a method of dividing data into a fixed length by dividing data into a fixed length and calculating a Huffman code from the frequency of appearance of the fixed length symbol to perform compression, , And Huffman coding is calculated and compression is performed, so that more effective compression can be performed.
Huffman coding generates a variable length code from a fixed length symbol and the LZW compression method is a compression method that generates a fixed length code from a variable length symbol.
The present invention is a scheme for a compression method for improving Huffman coding to generate a variable length code from a variable length symbol.
First, any binary data
0101000001001011000000000000000000100000000000000000000000000000000000000000100001000000001101010011110010001101000101101110110100000001000000000000000001111000000011000000000000000 ....
A binary number "10" of 2 bits is added before the most significant bit. By doing this, arbitrary binary data can be changed to start with "10 ", and the added" 10 "is referred to as a compulsory header (hereinafter referred to as CH) in the present invention. This CH may vary, since it is necessary to start any binary number with a "10 ", so that if CH starts with" 10 " and is a binary number consisting of zero or more consecutive "0" Everything is possible.
100 "," 100 "," 100 "," 101 "," 1001 "," 1011 "," 1000001 ", and so on. However, in decryption, what CH You should be aware of the information. In this embodiment, since the CH is short, an example in which "10", which is a binary number starting from the smallest "10", is added to CH before the most significant bit of an arbitrary binary number.
If there is data of 876,040 bits of arbitrary binary data, "10" is added to the most significant bit, and Huffman compression is performed according to the present invention for 876,042 bits of binary data to be compressed.
Step 1: For binary data to be compressed first, binary data is separated each time it encounters "10 " in the direction from the most significant bit to the least significant bit, or vice versa.
The binary data is separated at the point where the change of the bit value occurs. Let's call it a binary cluster.
For example, if you look at the binary number "101110001110", you can separate the binary data at the points where you first encounter "10", as shown below.
1011/100011/10
In the same manner as described above, 876,042 bits of binary data to be compressed
100101000001001011000000000000000001000000000000000000000000000000000000000001000010000000011010100111100100011010001011011101101000000010000000000000000011110000000110000000000000 ....
To "10" for the first time. Let's call each of these binary chunks a binary cluster. For the sake of brevity, binary clusters generated in this way can be expressed as binary numbers beginning with binary "10" followed by zero or more binary "0" followed by zero or more binary "1".
100/10/100000/100/101/10000001/100000/100000/10/1000000000000000/10000000000000/100000000000000000000000000000/10000/100000000001/10/10/100111/100/1000 / 10000000/100000000000000000111/100000001/1000000000000000 ....
Step 2: Next, as shown in Table 1 below, the result of dividing a total of 876,042 bits into binary cluster units is sequentially shown. Only 30 binary clusters were shown sequentially.
In the case of the binary data of this embodiment, it can be seen that it is divided into 207,982 binary clusters.
Table 2 shows the binary clusters in Table 1, and the frequency of appearance of each cluster is shown in Table 2. Some contents are omitted because of the ground relationship.
Huffman coding is performed using the binary distribution table of Table 2 above.
Huffman coding is a kind of entropy coding used in lossless compression. It is an algorithm that produces a prefix code (a code in which a symbol of one symbol does not become a prefix of another symbol code) from the frequency of a symbol. Allocate longer code, and allocate shorter codes for the more frequent symbols. In the present invention, an algorithm for generating a variable length code in a "variable length" symbol is developed and utilized in the feature of this Huffman hatching .
There are many ways to design Huffman codes. One example is a method of generating Huffman codes using a binary tree. In the binary tree structure, an external node or a leaf node represents a corresponding symbol. The Huffman code for an arbitrary symbol starts from a root node and descends to a leaf node corresponding to a symbol, and generates a code word assigned to each branch by mode combining.
In the figure, in the case of the A symbol, "0", which sequentially applies the bits assigned to the arm, which is the route descending from the root node through the inner node, is the Huffman code,
10 "for symbol B," 110 "for symbol C, and" 111 "for symbol D, respectively.
The Huffman code of the minimum mutation among the methods of constructing the Huffman tree is described as an example. When the two symbols having the lowest probability or frequency are combined, the sum of the probability of occurrence of two symbols or the sum of the frequencies is The code is re-sorted by comparing with the probability of the remaining symbols, and then the code generation process is repeated.
The minimum variation Huffman code for the set S = {a, b, c, d, e} consisting of five symbols is 2 times, 4 times , 2 times, 1 time, 1 time. Of course, the probability of occurrence other than the number may be used to calculate the same calculation procedure.
In the present invention, binary clusters in which the set of symbols a, b, c, d, e, etc. appear immediately become symbols.
The minimal transition Huffman code start combines two symbols d and e with the lowest probability of occurrence or frequency of occurrence. And assigns different bits to the last bit of the codeword corresponding to the two symbols. At this time, "1" is assigned to the side where the probability of occurrence or frequency of appearance is smaller than "0" to the side of which the probability of occurrence or frequency of occurrence is higher, and when the probability or appearance frequency is the same, can do. In this embodiment, when symbols are sorted in ascending order, "0"
Next, these two symbols (d, e) are combined to create a new symbol n1. The frequency of occurrence of the new symbol n1 is equal to the sum of the occurrences of the two symbols d and e. Therefore, the frequency of n1 is 2. If there is an existing symbol having the same probability or appearance frequency as the probability of the newly generated symbol upon reordering the symbols according to the occurrence frequencies of the four symbols a, b, n1, and c, Place the symbol above the existing symbol with the same probability or frequency of occurrence. This differs from regular huffman code and minimal variation huffman code.
The above process is shown below.
This process is summarized as follows. The sum of the signs of the paths down to each leaf node is the Huffman code for each symbol.
Table 14 shows the result of generating the Huffman code by applying the actual binary cluster as a symbol to the above example.
The above Huffman coding calculation method is a result using a minimum variation tree. It is a matter of course that it is possible to perform coding and decoding using a very wide variety of known Huffman tree generation methods such as general Huffman tree and adaptive Huffman coding. In one embodiment, the Huffman coding method using a minimum variation tree is used.
Table 2 below shows Huffman coding for each binary cluster in Table 2 based on symbols and frequency. In this case, the Huffman coding scheme generates a minimum mutation tree and constructs a code tree and generates Huffman codes using the tree. Only a part of it is represented by the ground relation.
Of course, Huffman codes can be generated in a wide variety of ways such as maximum transition tree and adaptive Huffman coding, but there may be some differences in the compression ratio.
Now, with the binary cluster-Huffman code conversion table in Table 3, compressed data is generated by replacing each binary cluster as shown below with a Huffman code.
100/10/100000/100/101/10000001/100000/100000/10/1000000000000000/10000000000000/100000000000000000000000000000/10000/100000000001/10/10/100111/100/1000 / 10000000/100000000000000000111/100000001/1000000000000000 ....
The results are shown in Table 4 below. Moving from the top to the bottom, the final compressed data is generated while replacing each binary cluster with the code conversion table of Table 3. The actual compressed data is physically stored in a binary data form in which Huffman codes shown in Table 4 are successively continued in 1: 1 corresponding to the order of appearance of binary clusters.
Compressed data can be generated while converting Huffman codes to binary clusters (symbols) at 1: 1 according to the binary cluster-Huffman code table of Table 3 as shown in Table 4,
Original data consisting of binary cluster symbols,
100/10/100000/100/101/10000001/100000/100000/10/1000000000000000/10000000000000/100000000000000000000000000000/10000/100000000001/10/10/100111/100/1000 / 10000000/100000000000000000111/100000001/100000000000 0000 .... The compressed data of the following form is as follows. In the following, "/" is a conceptual separation of each binary cluster symbol to represent the corresponding Huffman code, and is actually a nonexistent character.
101/01/000001/101/001/000000010/000001/000001/01/000000111101/000000110100/000000111110/11100/100111110010/01/01/000011/101/11101/01/1100/001/1000/001/01
/ 0001110/00001000010011010/0000001100/00001000011000/1101 / ..
The compression result can be calculated in advance as shown in Table 5. Some parts are omitted for the sake of the paper.
The difference between the length of the binary cluster symbol corresponding to each binary cluster symbol and the length of the Huffman code corresponding to each binary cluster symbol is a compression gain per symbol (in this specification, - (minus) is expressed as a gain), and multiplying the compression gain by the appearance frequency of the binary cluster symbol The compression gain sum is calculated for all the binary cluster symbols equally, and then the result is the final sum compression gain.
As shown in Table 5, the binary data to be compressed includes 876,042 bits including CH, and it can be seen that a compression gain of 34,466 bits occurs at 841,576 bits after compression.
In addition to the compression result data, Huffman coding is accompanied by a code dictionary for decompression, which is the same as the code dictionary in the Huffman coding method. That is, in the code prefix part, the dictionary information in which the corresponding Huffman codes are mapped for each binary cluster symbol can be added by borrowing a pre-information addition form of the known Huffman coding method. Further, in the present invention, Respectively.
A) First, in order to construct a coding dictionary, the bit length information of consecutive "0" after "10" of the binary cluster and the bit length information of consecutive "1" and the appearance frequency information are stored , It is possible to know exactly the value of the binary cluster from the bit length of consecutive "0" s after "10" of the binary cluster and the bit length information of consecutive "1", and the frequency of occurrence of the symbol It is possible to accurately reproduce the Huffman code for each binary cluster symbol by applying the Huffman tree generation algorithm in the same manner as the Huffman tree generation method used at the time of compression.
That is, when the binary cluster-occurrence frequency information is stored together with the dictionary information for decompression in the Huffman coding as shown in Table 6 below, the decompression is performed using the same Huffman tree implementation method as in compression The binary cluster-specific Huffman code can also be obtained at the decompression stage.
In particular, Table 6 can be expressed more compactly as shown in Table 13. The binary cluster symbols always start with "10 " due to their characteristics, and the binary lengths are 10, 100, 1000, 10000, 1011 "," 101111 ", ..., and so on, there are binary clusters that can be expressed simply as" 10 "followed by" 0 " There are binary clusters that can be expressed simply by the number of consecutive "1 " s after" 10 ". In the case of binary clusters composed of one or more consecutive "0s" followed by "10s" and one or more consecutive "1s" such as "1000111", "10000001", "100011", ..., Quot; 10 "and the number of consecutive" 0 "s and the number of consecutive" 1 " s. For example, in the case of "10 ", since the number of consecutive" 0 "s after" 10 " is 0, . In the case of "101 ", 0 is consecutively followed by " 10 ", and the consecutive" 1 "
As described above, the encoding dictionary can be more effectively represented by a method that can express each binary cluster symbol with ease,
Also, in the case of the appearance frequency, the occurrence frequency for the appearance frequency values of the second column of Table 6 is separately calculated as shown in Table 7. Then, the appearance frequency value symbols are subjected to Huffman coding based on Table 7, Thus, it is possible to compress and express the information on the appearance frequency of each symbol in the dictionary information.
B) In the next method of constructing a dictionary for decompression,
As shown in Table 8 below, the binary cluster, the Huffman code, and the length information of the Huffman code are stored. A) method differs from Huffman code in that it does not reproduce the Huffman code again in decoding but stores the Huffman code directly in the encoding dictionary. The symbol length of the binary cluster may be expressed more effectively as shown in Table 13 of the method A, and the Huffman codes of Table 8 may be stored consecutively in the encoding dictionary. The Huffman code length, which can be divided into the respective symbol Huffman codes, The binary cluster symbols corresponding to each of the Huffman codes can be sequentially recovered from the dictionary information.
We will now briefly explain the decompression process.
The decompression process is performed from the encoded dictionary and the compressed data and the CH binary number for decompression configured as described above.
First, binary cluster symbols and Huffman code dictionaries as shown in Table 6 or Table 8 are reproduced from the decompression dictionary in the form shown in Table 9. [
Using the dictionary information shown in Table 9 above, in the following compressed data composed of continuous Huffman codes,
101/01/000001/101/001/000000010/000001/000001/01/000000111101/000000110100/000000111110/11100/100111110010/01/01/000011/101/11101/01/1100/001/1000/001/01
/ 0001110/00001000010011010/0000001100/00001000011000/1101 / ..
And sequentially recover the respective binary cluster symbols corresponding to the Huffman codes. The restoration results are shown in Table 1, and physically it will look like this.
100/10/100000/100/101/10000001/100000/100000/10/1000000000000000/10000000000000/100000000000000000000000000000/10000/100000000001/10/10/100111/100/1000 / 10000000/100000000000000000111/100000001/1000000000000000 ....
Now stripping the conceptually separated "/" and looking at one stream, it looks like this:
100101000001001011000000000000000001000000000000000000000000000000000000000001000010000000011010100111100100011010001011011101101000000010000000000000000011110000000110000000000000 ....
Finally, since CH information "10" added at the time of compression is known, if CH is removed from the most significant bit, it is decompressed to any binary number before compression of 876,040 bits as shown below.
0101000001001011000000000000000000100000000000000000000000000000000000000000100001000000001101010011110010001101000101101110110100000001000000000000000001111000000011000000000000000 ....
In order to compare the effects of the present invention, the same original binary data is divided into fixed length symbols to produce fixed length symbols, and the results are shown in the following Huffman coding scheme according to the appearance frequency of the symbols. Fixed length is one of the most commonly used 8 bits, and it can be compressed in various ways as well as fixed length.
Table 10 shows the results of dividing the same data into 8-bit fixed lengths.
Table 11 shows the results of Huffman coding according to the occurrence frequencies and appearance frequencies of the 8-bit fixed-length symbols. The Huffman encoding algorithm at this time is the Huffman encoding result using the minimum mutation TREE that is the same as the Huffman encoding algorithm applied to the present invention.
As can be seen in Table 11, it can be seen that 876,040 bits of binary data are compressed into 863,345 bits of binary data. The fact that the original data is 2 bits smaller does not require the addition of a "10" . On the other hand, it can be seen that the result is not better than the result of compressing to 841,576 bits of the present invention. Although we did not consider a dictionary for decompression, we can use the present invention to obtain a bit reduction effect of about 22,000 bits and this trend is expected to be better than the Huffman algorithm for all data types . Since the number of symbols in Table 11 is 256 and the number of symbols is a number from 0 to 255 from "00000000" to "11111111", the complexity is great, and the number of symbols using the mosaic cluster of the present invention is only 231, It is expected that the present invention will be more effective when it is calculated by including advance information for decompression because it is a binary number of a type that can be easily expressed.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150022947A KR20160100496A (en) | 2015-02-15 | 2015-02-15 | Improved huffman code method and apprartus thereof by using binary clusters |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150022947A KR20160100496A (en) | 2015-02-15 | 2015-02-15 | Improved huffman code method and apprartus thereof by using binary clusters |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20160100496A true KR20160100496A (en) | 2016-08-24 |
Family
ID=56884030
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020150022947A KR20160100496A (en) | 2015-02-15 | 2015-02-15 | Improved huffman code method and apprartus thereof by using binary clusters |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20160100496A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114885033A (en) * | 2022-04-26 | 2022-08-09 | 青岛鼎信通讯股份有限公司 | Data frame compression method based on power line communication protocol |
CN116108506A (en) * | 2023-04-12 | 2023-05-12 | 广东奥飞数据科技股份有限公司 | Meta-universe digital asset security management system |
CN116318174A (en) * | 2023-05-15 | 2023-06-23 | 青岛国源中创电气自动化工程有限公司 | Data management method of garbage transportation management system of sewage treatment plant |
-
2015
- 2015-02-15 KR KR1020150022947A patent/KR20160100496A/en unknown
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114885033A (en) * | 2022-04-26 | 2022-08-09 | 青岛鼎信通讯股份有限公司 | Data frame compression method based on power line communication protocol |
CN116108506A (en) * | 2023-04-12 | 2023-05-12 | 广东奥飞数据科技股份有限公司 | Meta-universe digital asset security management system |
CN116108506B (en) * | 2023-04-12 | 2023-06-23 | 广东奥飞数据科技股份有限公司 | Meta-universe digital asset security management system |
CN116318174A (en) * | 2023-05-15 | 2023-06-23 | 青岛国源中创电气自动化工程有限公司 | Data management method of garbage transportation management system of sewage treatment plant |
CN116318174B (en) * | 2023-05-15 | 2023-08-15 | 青岛国源中创电气自动化工程有限公司 | Data management method of garbage transportation management system of sewage treatment plant |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5867114A (en) | Method and apparatus for performing data compression | |
JP2019054532A (en) | Apparatus and method for performing vlsi efficient huffman encoding | |
US20120130965A1 (en) | Data compression method | |
CN106407285A (en) | RLE and LZW-based optimized bit file compression and decompression method | |
CN107565970B (en) | Hybrid lossless compression method and device based on feature recognition | |
CN104125475B (en) | Multi-dimensional quantum data compressing and uncompressing method and apparatus | |
KR101023536B1 (en) | Lossless data compression method | |
KR20160100496A (en) | Improved huffman code method and apprartus thereof by using binary clusters | |
Ezhilarasu et al. | Arithmetic Coding for Lossless Data Compression–A Review | |
Belodedov et al. | Development of an algorithm for optimal encoding of WAV files using genetic algorithms | |
CN114726926A (en) | Self-adaptive variable length coding method for Laplace information source | |
Mathpal et al. | A research paper on lossless data compression techniques | |
JP2022048930A (en) | Data compression method, data compression device, data compression program, data decompression method, data decompression device, and data decompression program | |
KR20160106229A (en) | IMPROVED HUFFMAN CODING METHOD AND APPARATUS THEREOF BY CREATING CONTEXT-BASED INNER-BLOCK AND GROUP BASED ON VARIANCE IN GROUP's SYMBOL FREQUENCY DATA | |
Konecki et al. | Efficiency of lossless data compression | |
CN104682966A (en) | Non-destructive compressing method for list data | |
US10498358B2 (en) | Data encoder and data encoding method | |
KR20160100497A (en) | Improved huffman code method and apprartus thereof by using mosaic binary clusters | |
JP2005521324A (en) | Method and apparatus for lossless data compression and decompression | |
KR20160102593A (en) | IMPROVED HUFFMAN CODING METHOD AND APPARATUS THEREOF BY CREATING INNER-BLOCK AND GROUP BASED ON VARIANCE IN GROUP's SYMBOL FREQUENCY DATA | |
Tsai et al. | An improved LZW algorithm for large data size and low bitwidth per code | |
KR101573983B1 (en) | Method of data compressing, method of data recovering, and the apparatuses thereof | |
KR20160047686A (en) | Data compression method by increasing variance of clusters by dividing data based on collected number of the same kind of cluster | |
KR20160098687A (en) | Improved huffman code method and apprartus thereof by using binary clusters | |
KR20160049627A (en) | Enhancement of data compression rate by efficient mapping binary cluster with universal code based on frequency of binary cluster |