KR20180004409A - Universal real-time lossless data compression method of binary data encoded by utf-8 - Google Patents

Universal real-time lossless data compression method of binary data encoded by utf-8 Download PDF

Info

Publication number
KR20180004409A
KR20180004409A KR1020160083901A KR20160083901A KR20180004409A KR 20180004409 A KR20180004409 A KR 20180004409A KR 1020160083901 A KR1020160083901 A KR 1020160083901A KR 20160083901 A KR20160083901 A KR 20160083901A KR 20180004409 A KR20180004409 A KR 20180004409A
Authority
KR
South Korea
Prior art keywords
utf
unicode
bit
bytes
bits
Prior art date
Application number
KR1020160083901A
Other languages
Korean (ko)
Inventor
김정훈
Original Assignee
김정훈
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 김정훈 filed Critical 김정훈
Priority to KR1020160083901A priority Critical patent/KR20180004409A/en
Publication of KR20180004409A publication Critical patent/KR20180004409A/en

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6017Methods or arrangements to increase the throughput
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound
    • H03M7/705Unicode

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

In the present invention, provided is a universal compression method regarding to a UTF-8 encoded text. A UTF-8 code is invented by Ken Thompson and Rob Pike, wherein a UTF-8 is one of variable length character encoding schemes for Unicode. The UTF-8 is an abbreviation of universal coded character set + +transformation format 8-bit, and is originally proposed with a name of a file system safe UCS/Unicode transformation format (FSS-UTF). The UTF-8 encoding is used with 1 to 4 bytes in order to represent one Unicode character. The UTF-8 is defined by other methods in various standard documents, but a general structure thereof is the same. Bits indicating a Unicode code point are divided into several parts to be included in lower bits of the bites represented by the UTF-8. The character up to U+007F are displayed in the same manner as 7 bits ASCII characters, and the subsequent characters are displayed by a bit pattern up to 4 bytes as follows. The most significant bit of all bytes is 1 not to be confused with the 7 bit ASCII characters. As a result, a high compression efficiency is exhibited in the case of a country in which a native language takes overwhelmingly great importance in communications such as Korea, Japan, China, etc., which are non-English-speaking countries in a multi-lingual system, and compression is not performed even in an English-speaking country so data dose not increase.

Description

일반적인 UTF-8 형태로 엔코딩된 이진데이터의 실시간 무손실 압축방법{UNIVERSAL REAL-TIME LOSSLESS DATA COMPRESSION METHOD OF BINARY DATA ENCODED BY UTF-8}Technical Field [0001] The present invention relates to a real-time lossless compression method of binary data encoded in a general UTF-8 format,

UTF-8 엔코딩, 무손실 데이터 압축, 비엔트로피 데이터 압축UTF-8 encoding, lossless data compression, non-entropy data compression

UTF-8 엔코딩, 무손실 데이터 압축, 비엔트로피 데이터 압축UTF-8 encoding, lossless data compression, non-entropy data compression

발명을 실시하기 위한 구체적인 내용에 상술Detailed description of specific embodiments for carrying out the invention

발명을 실시하기 위한 구체적인 내용에 상술Detailed description of specific embodiments for carrying out the invention

발명을 실시하기 위한 구체적인 내용에 상술Detailed description of specific embodiments for carrying out the invention

UTF-8 코드는 UTF-8은 유니코드를 위한 가변 길이 문자 인코딩 방식 중 하나로, 켄 톰프슨과 롭 파이크가 만들었다. UTF-8은 Universal Coded Character Set + Transformation Format 8-bit 의 약자이다. 본래는 FSS-UTF(File System Safe UCS/Unicode Transformation Format)라는 이름으로 제안되었다.UTF-8 code UTF-8 is one of the variable-length character encoding methods for Unicode, made by Ken Thompson and Rob Pike. UTF-8 stands for Universal Coded Character Set + Transformation Format 8-bit. It was originally proposed as FSS-UTF (File System Safe UCS / Unicode Transformation Format).

UTF-8 인코딩은 유니코드 한 문자를 나타내기 위해 1바이트에서 4바이트까지를 사용한다. UTF-8 encoding uses 1 to 4 bytes to represent Unicode characters.

UTF-8은 여러 표준 문서에서 다른 방법으로 정의되어 있지만, 일반적인 구조는 모두 동일하다.UTF-8 is defined in different ways in several standard documents, but the general structure is the same.

유니코드 코드 포인트를 나타내는 비트들은 여러 부분으로 나뉘어서, UTF-8로 표현된 바이트의 하위 비트들에 들어 간다. U+007F까지의 문자는 7비트 ASCII 문자와 동일한 방법으로 표시되며, 그 이후의 문자는 다음과 같은 4바이트까지의 비트 패턴으로 표시된다. 7비트 ASCII 문자와 혼동되지 않게 하기 위하여 모든 바이트들의 최상위 비트는 1이다.The bits representing the Unicode code point are divided into several parts and entered into the lower bits of the byte represented by UTF-8. Characters up to U + 007F are displayed in the same way as 7-bit ASCII characters, and the characters after that are represented by the following 4-bit pattern. To avoid confusion with 7-bit ASCII characters, the most significant bit of all bytes is 1.

이를 통해 다국어체계에서 비영어권 국가인 한국,일본,중국등에서 통신상 에서 자국어 문자의 비중이 압도적으로 높은 국가의 경우 높은 압축효율을 보이며, 영어권이라 하여도 압축이 되지 않아서 데이터가 늘어나지 않는다. 따라서 본 발명은 UTF-8 encoded text에 있어서 universal 한 압축방법을 제공한다.In this way, the countries with high dominance of native language characters in Korea, Japan, and China, which are non-English speaking countries in the multi-lingual system, show high compression efficiency. Thus, the present invention provides a universal compression method for UTF-8 encoded text.

아래 표1은 UTF-8 코드체계이다.Table 1 below shows the UTF-8 encoding scheme.

Bits of code pointBits of code point First code pointFirst code point Last code pointLast code point Bytes in sequenceBytes in sequence Byte 1Byte 1 Byte 2Byte 2 Byte 3Byte 3 Byte 4Byte 4 Byte 5Byte 5 Byte 6Byte 6 UTF-8코드UTF-8 code 77 U+0000U + 0000 U+007FU + 007F 1One 0xxxxxxx0xxxxxxx 0xxxxxxx0xxxxxxx 1111 U+0080U + 0080 U+07FFU + 07FF 22 110xxxxx110xxxxx 10xxxxxx10xxxxxx 110xxxxx10xxxxxx110xxxxx10xxxxxx 1616 U+0800U + 0800 U+FFFFU + FFFF 33 1110xxxx1110xxxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 1110xxxx10xxxxxx10xxxxxx1110xxxx10xxxxxx10xxxxxx 2121 U+10000U + 10000 U+1FFFFFU + 1FFFFF 44 11110xxx11110xxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 11110xxx10xxxxxx10xxxxxx10xxxxxx11110xxx10xxxxxx10xxxxxx10xxxxxx The following sequences are not part of the UTF-8 standard, only part of the original proposalThe following sequences are not part of the UTF-8 standard, only part of the original proposal 2626 U+200000U + 200000 U+3FFFFFFU + 3FFFFFF 55 111110xx111110xx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 111110xx10xxxxxx10xxxxxx10xxxxxx10xxxxxx111110xx10xxxxxx10xxxxxx10xxxxxx10xxxxxx 3131 U+4000000U + 4000000 U+7FFFFFFFU + 7FFFFFFF 66 1111110x1111110x 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 10xxxxxx10xxxxxx 1111110x10xxxxxx10xxxxxx10xxxxxx10xxxxxx10xxxxxx1111110x10xxxxxx10xxxxxx10xxxxxx10xxxxxx10xxxxxx

본 발명에서 제안하는 압축알고리즘은 Byte 1 의 최상위 비트가 1로 시작하는 경우인 110, 1110, 11110, 111110, 1111110 에 대하여, 각각 11, 101, 1001, 10001, 100001 로 대체하고 후속되는 Byte 2 ~ Byte 6까지의 최상위 비트 "10"을 제거하여 압축된 중간 코드를 생성하면 다음 표2와 같다.The compression algorithm proposed in the present invention is to replace 11, 101, 1001, 10001, and 100001 with 110, 1110, 11110, 111110, and 1111110 in which the most significant bits of Byte 1 start with 1, The compressed intermediate code is generated by removing the most significant bit "10 "

Byte 1 의 최상위 비트가 0 으로 시작하는 경우는 별도로 압축연산을 수행하지 않If the most significant bit of Byte 1 starts with 0, no compression operation is performed separately.

는다.It is.

Bits of code pointBits of code point First code pointFirst code point Last code pointLast code point Bytes in sequenceBytes in sequence Byte 1Byte 1 Byte 2Byte 2 Byte 3Byte 3 Byte 4Byte 4 Byte 5Byte 5 Byte 6Byte 6 최종압축코드Final Compression Code 77 U+0000U + 0000 U+007FU + 007F 1One 0xxxxxxx0xxxxxxx 0xxxxxxx0xxxxxxx 1111 U+0080U + 0080 U+07FFU + 07FF 22 11xxxxx11xxxxx xxxxxxxxxxxx 11xxxxxxxxxxx11xxxxxxxxxxx 1616 U+0800U + 0800 U+FFFFU + FFFF 33 101xxxx101xxxx xxxxxxxxxxxx xxxxxxxxxxxx 101xxxxxxxxxxxxxxxx101xxxxxxxxxxxxxxxx 2121 U+10000U + 10000 U+1FFFFFU + 1FFFFF 44 1001xxx1001xxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx 1001xxxxxxxxxxxxxxxxxxxxx1001xxxxxxxxxxxxxxxxxxxxx The following sequences are not part of the UTF-8 standard, only part of the original proposalThe following sequences are not part of the UTF-8 standard, only part of the original proposal 2626 U+200000U + 200000 U+3FFFFFFU + 3FFFFFF 55 10001xx10001xx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx 10001xxxxxxxxxxxxxxxxxxxxxxxxxx10001xxxxxxxxxxxxxxxxxxxxxxxxxx 3131 U+4000000U + 4000000 U+7FFFFFFFU + 7FFFFFFF 66 100001x100001x xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx 100001xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx100001xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

압축된 코드 역시 최상위 비트가 "1"로 시작한다.The compressed code also starts with the most significant bit "1".

이렇게 압축된 코드를 해독하는 방법은, To decode this compressed code,

[1] 먼저 최상위 비트를 읽어들여서 "0"으로 시작하면 이후 7비트를 읽어서, 그대로 UTF-8 코드로 한다.[1] First read the most significant bit and start with "0", then read the next 7 bits and use it as UTF-8 code.

[2] 1로 시작하면, 다음에 다시 "1"을 만날때까지의 숫자를 분리하면,[2] If you start with 1 and then divide the numbers until you get to "1" again,

"11", "101", "1001", "10001", "100001" 중의 하나일 것이며, 그 경우, 해당 압축된 헤더들을 각각 압축이전의 byte 1의 헤더로 복원하면, 각각 "110", "1110", "11110", "111110", "1111110" 으로 복원된다.110 ", "1001 ", " 10001 ", and" 100001 ", respectively. In this case, if the compressed headers are restored to the header of byte 1 before compression, 1110 "," 11110 "," 111110 ", and" 1111110 ".

이제 각 헤더를 복원하면, 각 각 byte 1에서 추가로 읽어야 하는 비트수를 알수있는데, 예를들어, "110" 이 byte 1의 헤더로 복원되었다면, 표1에 따라 추가로 5비트를 읽어서 바이트 1을 완성하고, 이후 추가로 1바이트를 복원해야하므로, 추가로 압축데이터에서 6비트를 읽은뒤, 최상위 비트에 "10"을 추가하여 바이트 2를 복원한다.For example, if "110" is restored to the header of byte 1, then an additional 5 bits are read according to Table 1, and byte 1 And further one byte is to be restored thereafter, 6 bits are further read from the compressed data, and "10" is added to the most significant bit to restore the byte 2.

"1110" 이 byte 1의 헤더로 복원되었다면, 표1에 따라 추가로 4비트를 읽어서 바이트 1을 복원하고, 이후 추가로 2바이트를 복원해야하므로, 추가로 압축데이터에서 각각 6비트씩 두번 읽은뒤, 각 6비트의 최상위 비트 앞에 "10"을 추가하여, 바이트 2 및 바이트 3를 복원하는 방식으로 UTF-8코드로 복원하게 된다. If "1110" is restored to the header of byte 1, an additional 4 bits are read in accordance with Table 1 to restore byte 1, and then two additional bytes must be restored. , "10" is added in front of the most significant bit of each of the 6 bits, and then the UTF-8 code is restored by restoring the byte 2 and the byte 3.

이렇게 복원된 다음비트부터 다시 상기과정을 압축데이터의 끝까지 진행하면 UTF-8코드가 완벽히 압축해제 된다.The UTF-8 code is completely decompressed if the above process is repeated from the next bit restored to the end of the compressed data.

Claims (1)

국내우선권 주장출원을 위한 선출원으로서별도의 청구범위 기재하지 않음As a prior application for domestic priority claim application, no separate claim is specified
KR1020160083901A 2016-07-04 2016-07-04 Universal real-time lossless data compression method of binary data encoded by utf-8 KR20180004409A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020160083901A KR20180004409A (en) 2016-07-04 2016-07-04 Universal real-time lossless data compression method of binary data encoded by utf-8

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020160083901A KR20180004409A (en) 2016-07-04 2016-07-04 Universal real-time lossless data compression method of binary data encoded by utf-8

Publications (1)

Publication Number Publication Date
KR20180004409A true KR20180004409A (en) 2018-01-12

Family

ID=61000866

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020160083901A KR20180004409A (en) 2016-07-04 2016-07-04 Universal real-time lossless data compression method of binary data encoded by utf-8

Country Status (1)

Country Link
KR (1) KR20180004409A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113283215A (en) * 2021-07-15 2021-08-20 北京华云安信息技术有限公司 Data confusion method and device based on UTF-32 coding

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113283215A (en) * 2021-07-15 2021-08-20 北京华云安信息技术有限公司 Data confusion method and device based on UTF-32 coding

Similar Documents

Publication Publication Date Title
KR101610609B1 (en) Data encoder, data decoder and method
US8933826B2 (en) Encoder apparatus, decoder apparatus and method
CN108737976A (en) A kind of compression transmitting method based on Big Dipper short message
CN104467868A (en) Chinese text compression method
CN103347047A (en) Lossless data compression method based on online dictionaries
KR101667240B1 (en) Secure and lossless data compression
KR20180004409A (en) Universal real-time lossless data compression method of binary data encoded by utf-8
KR20180004410A (en) Universal real-time lossless data compression method of binary data encoded by utf-8
KR101791877B1 (en) Method and apparatus for compressing utf-8 code character
KR20180047738A (en) UTF-8 character set compression method and apparatus thereof
KR101752281B1 (en) Method and apparatus for compressing utf-8 code character
KR101791880B1 (en) Method and apparatus for compressing utf-8 code character
RU2437148C1 (en) Method to compress and to restore messages in systems of text information processing, transfer and storage
EP2113845A1 (en) Character conversion method and apparatus
KR20180006011A (en) HANGUL ON UTF-8 character set COMPRESSION Method and Appratus Thereof
KR20180006605A (en) HANGUL ON UTF-8 character set COMPRESSION Method and Appratus Thereof
KR20180005386A (en) Utf-8 code real-time lossless compression by using universal code method and appratus thereof
KR20180008034A (en) HANGUL ON UTF-8 character set COMPRESSION Method and Appratus Thereof
KR20160049627A (en) Enhancement of data compression rate by efficient mapping binary cluster with universal code based on frequency of binary cluster
KR20180008226A (en) HANGUL ON UTF-8 character set COMPRESSION Method and Appratus Thereof
KR20180009060A (en) HANGUL ON UTF-8 character set COMPRESSION Method and Appratus Thereof
KR20190091586A (en) TCP/IP Packet data compression method and appratus based on binary compression method
KR20180007740A (en) HANGUL ON UTF-8 character set COMPRESSION Method and Appratus Thereof
KR20180009088A (en) HANGUL ON UTF-8 character set COMPRESSION Method and Appratus Thereof
KR101573983B1 (en) Method of data compressing, method of data recovering, and the apparatuses thereof