KR20180004409A

KR20180004409A - Universal real-time lossless data compression method of binary data encoded by utf-8

Info

Publication number: KR20180004409A
Application number: KR1020160083901A
Authority: KR
Inventors: 김정훈
Original assignee: 김정훈
Priority date: 2016-07-04
Filing date: 2016-07-04
Publication date: 2018-01-12

Abstract

In the present invention, provided is a universal compression method regarding to a UTF-8 encoded text. A UTF-8 code is invented by Ken Thompson and Rob Pike, wherein a UTF-8 is one of variable length character encoding schemes for Unicode. The UTF-8 is an abbreviation of universal coded character set + +transformation format 8-bit, and is originally proposed with a name of a file system safe UCS/Unicode transformation format (FSS-UTF). The UTF-8 encoding is used with 1 to 4 bytes in order to represent one Unicode character. The UTF-8 is defined by other methods in various standard documents, but a general structure thereof is the same. Bits indicating a Unicode code point are divided into several parts to be included in lower bits of the bites represented by the UTF-8. The character up to U+007F are displayed in the same manner as 7 bits ASCII characters, and the subsequent characters are displayed by a bit pattern up to 4 bytes as follows. The most significant bit of all bytes is 1 not to be confused with the 7 bit ASCII characters. As a result, a high compression efficiency is exhibited in the case of a country in which a native language takes overwhelmingly great importance in communications such as Korea, Japan, China, etc., which are non-English-speaking countries in a multi-lingual system, and compression is not performed even in an English-speaking country so data dose not increase.

Description

Technical Field [0001] The present invention relates to a real-time lossless compression method of binary data encoded in a general UTF-8 format,

UTF-8 엔코딩, 무손실 데이터 압축, 비엔트로피 데이터 압축UTF-8 encoding, lossless data compression, non-entropy data compression

발명을 실시하기 위한 구체적인 내용에 상술Detailed description of specific embodiments for carrying out the invention

UTF-8 코드는 UTF-8은 유니코드를 위한 가변 길이 문자 인코딩 방식 중 하나로, 켄 톰프슨과 롭 파이크가 만들었다. UTF-8은 Universal Coded Character Set + Transformation Format 8-bit 의 약자이다. 본래는 FSS-UTF(File System Safe UCS/Unicode Transformation Format)라는 이름으로 제안되었다.UTF-8 code UTF-8 is one of the variable-length character encoding methods for Unicode, made by Ken Thompson and Rob Pike. UTF-8 stands for Universal Coded Character Set + Transformation Format 8-bit. It was originally proposed as FSS-UTF (File System Safe UCS / Unicode Transformation Format).

UTF-8 인코딩은 유니코드 한 문자를 나타내기 위해 1바이트에서 4바이트까지를 사용한다. UTF-8 encoding uses 1 to 4 bytes to represent Unicode characters.

UTF-8은 여러 표준 문서에서 다른 방법으로 정의되어 있지만, 일반적인 구조는 모두 동일하다.UTF-8 is defined in different ways in several standard documents, but the general structure is the same.

유니코드 코드 포인트를 나타내는 비트들은 여러 부분으로 나뉘어서, UTF-8로 표현된 바이트의 하위 비트들에 들어 간다. U+007F까지의 문자는 7비트 ASCII 문자와 동일한 방법으로 표시되며, 그 이후의 문자는 다음과 같은 4바이트까지의 비트 패턴으로 표시된다. 7비트 ASCII 문자와 혼동되지 않게 하기 위하여 모든 바이트들의 최상위 비트는 1이다.The bits representing the Unicode code point are divided into several parts and entered into the lower bits of the byte represented by UTF-8. Characters up to U + 007F are displayed in the same way as 7-bit ASCII characters, and the characters after that are represented by the following 4-bit pattern. To avoid confusion with 7-bit ASCII characters, the most significant bit of all bytes is 1.

이를 통해 다국어체계에서 비영어권 국가인 한국,일본,중국등에서 통신상 에서 자국어 문자의 비중이 압도적으로 높은 국가의 경우 높은 압축효율을 보이며, 영어권이라 하여도 압축이 되지 않아서 데이터가 늘어나지 않는다. 따라서 본 발명은 UTF-8 encoded text에 있어서 universal 한 압축방법을 제공한다.In this way, the countries with high dominance of native language characters in Korea, Japan, and China, which are non-English speaking countries in the multi-lingual system, show high compression efficiency. Thus, the present invention provides a universal compression method for UTF-8 encoded text.

아래 표1은 UTF-8 코드체계이다.Table 1 below shows the UTF-8 encoding scheme.

Bits of code pointBits of code point First code pointFirst code point Last code pointLast code point Bytes in sequenceBytes in sequence Byte 1Byte 1 Byte 2Byte 2 Byte 3Byte 3 Byte 4Byte 4 Byte 5Byte 5 Byte 6Byte 6 UTF-8코드UTF-8 code 77 U+0000U + 0000 U+007FU + 007F 1One 0xxxxxxx0xxxxxxx 0xxxxxxx0xxxxxxx 1111 U+0080U + 0080 U+07FFU + 07FF 22 110xxxxx110xxxxx 10xxxxxx10xxxxxx 110xxxxx10xxxxxx110xxxxx10xxxxxx 1616 U+0800U + 0800 U+FFFFU + FFFF 33 1110xxxx1110xxxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 1110xxxx10xxxxxx10xxxxxx1110xxxx10xxxxxx10xxxxxx 2121 U+10000U + 10000 U+1FFFFFU + 1FFFFF 44 11110xxx11110xxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 11110xxx10xxxxxx10xxxxxx10xxxxxx11110xxx10xxxxxx10xxxxxx10xxxxxx The following sequences are not part of the UTF-8 standard, only part of the original proposalThe following sequences are not part of the UTF-8 standard, only part of the original proposal 2626 U+200000U + 200000 U+3FFFFFFU + 3FFFFFF 55 111110xx111110xx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 111110xx10xxxxxx10xxxxxx10xxxxxx10xxxxxx111110xx10xxxxxx10xxxxxx10xxxxxx10xxxxxx 3131 U+4000000U + 4000000 U+7FFFFFFFU + 7FFFFFFF 66 1111110x1111110x 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 1111110x10xxxxxx10xxxxxx10xxxxxx10xxxxxx10xxxxxx1111110x10xxxxxx10xxxxxx10xxxxxx10xxxxxx10xxxxxx

본 발명에서 제안하는 압축알고리즘은 Byte 1 의 최상위 비트가 1로 시작하는 경우인 110, 1110, 11110, 111110, 1111110 에 대하여, 각각 11, 101, 1001, 10001, 100001 로 대체하고 후속되는 Byte 2 ~ Byte 6까지의 최상위 비트 "10"을 제거하여 압축된 중간 코드를 생성하면 다음 표2와 같다.The compression algorithm proposed in the present invention is to replace 11, 101, 1001, 10001, and 100001 with 110, 1110, 11110, 111110, and 1111110 in which the most significant bits of Byte 1 start with 1, The compressed intermediate code is generated by removing the most significant bit "10 "

Byte 1 의 최상위 비트가 0 으로 시작하는 경우는 별도로 압축연산을 수행하지 않If the most significant bit of Byte 1 starts with 0, no compression operation is performed separately.

는다.It is.

Bits of code pointBits of code point First code pointFirst code point Last code pointLast code point Bytes in sequenceBytes in sequence Byte 1Byte 1 Byte 2Byte 2 Byte 3Byte 3 Byte 4Byte 4 Byte 5Byte 5 Byte 6Byte 6 최종압축코드Final Compression Code 77 U+0000U + 0000 U+007FU + 007F 1One 0xxxxxxx0xxxxxxx 0xxxxxxx0xxxxxxx 1111 U+0080U + 0080 U+07FFU + 07FF 22 11xxxxx11xxxxx xxxxxxxxxxxx 11xxxxxxxxxxx11xxxxxxxxxxx 1616 U+0800U + 0800 U+FFFFU + FFFF 33 101xxxx101xxxx xxxxxxxxxxxx xxxxxxxxxxxx 101xxxxxxxxxxxxxxxx101xxxxxxxxxxxxxxxx 2121 U+10000U + 10000 U+1FFFFFU + 1FFFFF 44 1001xxx1001xxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx 1001xxxxxxxxxxxxxxxxxxxxx1001xxxxxxxxxxxxxxxxxxxxx The following sequences are not part of the UTF-8 standard, only part of the original proposalThe following sequences are not part of the UTF-8 standard, only part of the original proposal 2626 U+200000U + 200000 U+3FFFFFFU + 3FFFFFF 55 10001xx10001xx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx 10001xxxxxxxxxxxxxxxxxxxxxxxxxx10001xxxxxxxxxxxxxxxxxxxxxxxxxx 3131 U+4000000U + 4000000 U+7FFFFFFFU + 7FFFFFFF 66 100001x100001x xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx 100001xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx100001xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

압축된 코드 역시 최상위 비트가 "1"로 시작한다.The compressed code also starts with the most significant bit "1".

이렇게 압축된 코드를 해독하는 방법은, To decode this compressed code,

[1] 먼저 최상위 비트를 읽어들여서 "0"으로 시작하면 이후 7비트를 읽어서, 그대로 UTF-8 코드로 한다.[1] First read the most significant bit and start with "0", then read the next 7 bits and use it as UTF-8 code.

[2] 1로 시작하면, 다음에 다시 "1"을 만날때까지의 숫자를 분리하면,[2] If you start with 1 and then divide the numbers until you get to "1" again,

"11", "101", "1001", "10001", "100001" 중의 하나일 것이며, 그 경우, 해당 압축된 헤더들을 각각 압축이전의 byte 1의 헤더로 복원하면, 각각 "110", "1110", "11110", "111110", "1111110" 으로 복원된다.110 ", "1001 ", " 10001 ", and" 100001 ", respectively. In this case, if the compressed headers are restored to the header of byte 1 before compression, 1110 "," 11110 "," 111110 ", and" 1111110 ".

이제 각 헤더를 복원하면, 각 각 byte 1에서 추가로 읽어야 하는 비트수를 알수있는데, 예를들어, "110" 이 byte 1의 헤더로 복원되었다면, 표1에 따라 추가로 5비트를 읽어서 바이트 1을 완성하고, 이후 추가로 1바이트를 복원해야하므로, 추가로 압축데이터에서 6비트를 읽은뒤, 최상위 비트에 "10"을 추가하여 바이트 2를 복원한다.For example, if "110" is restored to the header of byte 1, then an additional 5 bits are read according to Table 1, and byte 1 And further one byte is to be restored thereafter, 6 bits are further read from the compressed data, and "10" is added to the most significant bit to restore the byte 2.

"1110" 이 byte 1의 헤더로 복원되었다면, 표1에 따라 추가로 4비트를 읽어서 바이트 1을 복원하고, 이후 추가로 2바이트를 복원해야하므로, 추가로 압축데이터에서 각각 6비트씩 두번 읽은뒤, 각 6비트의 최상위 비트 앞에 "10"을 추가하여, 바이트 2 및 바이트 3를 복원하는 방식으로 UTF-8코드로 복원하게 된다. If "1110" is restored to the header of byte 1, an additional 4 bits are read in accordance with Table 1 to restore byte 1, and then two additional bytes must be restored. , "10" is added in front of the most significant bit of each of the 6 bits, and then the UTF-8 code is restored by restoring the byte 2 and the byte 3.

이렇게 복원된 다음비트부터 다시 상기과정을 압축데이터의 끝까지 진행하면 UTF-8코드가 완벽히 압축해제 된다.The UTF-8 code is completely decompressed if the above process is repeated from the next bit restored to the end of the compressed data.

Claims

As a prior application for domestic priority claim application, no separate claim is specified