KR20180004409A - Universal real-time lossless data compression method of binary data encoded by utf-8 - Google Patents
Universal real-time lossless data compression method of binary data encoded by utf-8 Download PDFInfo
- Publication number
- KR20180004409A KR20180004409A KR1020160083901A KR20160083901A KR20180004409A KR 20180004409 A KR20180004409 A KR 20180004409A KR 1020160083901 A KR1020160083901 A KR 1020160083901A KR 20160083901 A KR20160083901 A KR 20160083901A KR 20180004409 A KR20180004409 A KR 20180004409A
- Authority
- KR
- South Korea
- Prior art keywords
- utf
- unicode
- bit
- bytes
- bits
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/60—General implementation details not specific to a particular type of compression
- H03M7/6017—Methods or arrangements to increase the throughput
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/70—Type of the data to be coded, other than image and sound
- H03M7/705—Unicode
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
UTF-8 엔코딩, 무손실 데이터 압축, 비엔트로피 데이터 압축UTF-8 encoding, lossless data compression, non-entropy data compression
UTF-8 엔코딩, 무손실 데이터 압축, 비엔트로피 데이터 압축UTF-8 encoding, lossless data compression, non-entropy data compression
발명을 실시하기 위한 구체적인 내용에 상술Detailed description of specific embodiments for carrying out the invention
발명을 실시하기 위한 구체적인 내용에 상술Detailed description of specific embodiments for carrying out the invention
발명을 실시하기 위한 구체적인 내용에 상술Detailed description of specific embodiments for carrying out the invention
UTF-8 코드는 UTF-8은 유니코드를 위한 가변 길이 문자 인코딩 방식 중 하나로, 켄 톰프슨과 롭 파이크가 만들었다. UTF-8은 Universal Coded Character Set + Transformation Format 8-bit 의 약자이다. 본래는 FSS-UTF(File System Safe UCS/Unicode Transformation Format)라는 이름으로 제안되었다.UTF-8 code UTF-8 is one of the variable-length character encoding methods for Unicode, made by Ken Thompson and Rob Pike. UTF-8 stands for Universal Coded Character Set + Transformation Format 8-bit. It was originally proposed as FSS-UTF (File System Safe UCS / Unicode Transformation Format).
UTF-8 인코딩은 유니코드 한 문자를 나타내기 위해 1바이트에서 4바이트까지를 사용한다. UTF-8 encoding uses 1 to 4 bytes to represent Unicode characters.
UTF-8은 여러 표준 문서에서 다른 방법으로 정의되어 있지만, 일반적인 구조는 모두 동일하다.UTF-8 is defined in different ways in several standard documents, but the general structure is the same.
유니코드 코드 포인트를 나타내는 비트들은 여러 부분으로 나뉘어서, UTF-8로 표현된 바이트의 하위 비트들에 들어 간다. U+007F까지의 문자는 7비트 ASCII 문자와 동일한 방법으로 표시되며, 그 이후의 문자는 다음과 같은 4바이트까지의 비트 패턴으로 표시된다. 7비트 ASCII 문자와 혼동되지 않게 하기 위하여 모든 바이트들의 최상위 비트는 1이다.The bits representing the Unicode code point are divided into several parts and entered into the lower bits of the byte represented by UTF-8. Characters up to U + 007F are displayed in the same way as 7-bit ASCII characters, and the characters after that are represented by the following 4-bit pattern. To avoid confusion with 7-bit ASCII characters, the most significant bit of all bytes is 1.
이를 통해 다국어체계에서 비영어권 국가인 한국,일본,중국등에서 통신상 에서 자국어 문자의 비중이 압도적으로 높은 국가의 경우 높은 압축효율을 보이며, 영어권이라 하여도 압축이 되지 않아서 데이터가 늘어나지 않는다. 따라서 본 발명은 UTF-8 encoded text에 있어서 universal 한 압축방법을 제공한다.In this way, the countries with high dominance of native language characters in Korea, Japan, and China, which are non-English speaking countries in the multi-lingual system, show high compression efficiency. Thus, the present invention provides a universal compression method for UTF-8 encoded text.
아래 표1은 UTF-8 코드체계이다.Table 1 below shows the UTF-8 encoding scheme.
본 발명에서 제안하는 압축알고리즘은 Byte 1 의 최상위 비트가 1로 시작하는 경우인 110, 1110, 11110, 111110, 1111110 에 대하여, 각각 11, 101, 1001, 10001, 100001 로 대체하고 후속되는 Byte 2 ~ Byte 6까지의 최상위 비트 "10"을 제거하여 압축된 중간 코드를 생성하면 다음 표2와 같다.The compression algorithm proposed in the present invention is to replace 11, 101, 1001, 10001, and 100001 with 110, 1110, 11110, 111110, and 1111110 in which the most significant bits of Byte 1 start with 1, The compressed intermediate code is generated by removing the most significant bit "10 "
Byte 1 의 최상위 비트가 0 으로 시작하는 경우는 별도로 압축연산을 수행하지 않If the most significant bit of Byte 1 starts with 0, no compression operation is performed separately.
는다.It is.
압축된 코드 역시 최상위 비트가 "1"로 시작한다.The compressed code also starts with the most significant bit "1".
이렇게 압축된 코드를 해독하는 방법은, To decode this compressed code,
[1] 먼저 최상위 비트를 읽어들여서 "0"으로 시작하면 이후 7비트를 읽어서, 그대로 UTF-8 코드로 한다.[1] First read the most significant bit and start with "0", then read the next 7 bits and use it as UTF-8 code.
[2] 1로 시작하면, 다음에 다시 "1"을 만날때까지의 숫자를 분리하면,[2] If you start with 1 and then divide the numbers until you get to "1" again,
"11", "101", "1001", "10001", "100001" 중의 하나일 것이며, 그 경우, 해당 압축된 헤더들을 각각 압축이전의 byte 1의 헤더로 복원하면, 각각 "110", "1110", "11110", "111110", "1111110" 으로 복원된다.110 ", "1001 ", " 10001 ", and" 100001 ", respectively. In this case, if the compressed headers are restored to the header of byte 1 before compression, 1110 "," 11110 "," 111110 ", and" 1111110 ".
이제 각 헤더를 복원하면, 각 각 byte 1에서 추가로 읽어야 하는 비트수를 알수있는데, 예를들어, "110" 이 byte 1의 헤더로 복원되었다면, 표1에 따라 추가로 5비트를 읽어서 바이트 1을 완성하고, 이후 추가로 1바이트를 복원해야하므로, 추가로 압축데이터에서 6비트를 읽은뒤, 최상위 비트에 "10"을 추가하여 바이트 2를 복원한다.For example, if "110" is restored to the header of byte 1, then an additional 5 bits are read according to Table 1, and byte 1 And further one byte is to be restored thereafter, 6 bits are further read from the compressed data, and "10" is added to the most significant bit to restore the byte 2.
"1110" 이 byte 1의 헤더로 복원되었다면, 표1에 따라 추가로 4비트를 읽어서 바이트 1을 복원하고, 이후 추가로 2바이트를 복원해야하므로, 추가로 압축데이터에서 각각 6비트씩 두번 읽은뒤, 각 6비트의 최상위 비트 앞에 "10"을 추가하여, 바이트 2 및 바이트 3를 복원하는 방식으로 UTF-8코드로 복원하게 된다. If "1110" is restored to the header of byte 1, an additional 4 bits are read in accordance with Table 1 to restore byte 1, and then two additional bytes must be restored. , "10" is added in front of the most significant bit of each of the 6 bits, and then the UTF-8 code is restored by restoring the byte 2 and the byte 3.
이렇게 복원된 다음비트부터 다시 상기과정을 압축데이터의 끝까지 진행하면 UTF-8코드가 완벽히 압축해제 된다.The UTF-8 code is completely decompressed if the above process is repeated from the next bit restored to the end of the compressed data.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020160083901A KR20180004409A (en) | 2016-07-04 | 2016-07-04 | Universal real-time lossless data compression method of binary data encoded by utf-8 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020160083901A KR20180004409A (en) | 2016-07-04 | 2016-07-04 | Universal real-time lossless data compression method of binary data encoded by utf-8 |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20180004409A true KR20180004409A (en) | 2018-01-12 |
Family
ID=61000866
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020160083901A KR20180004409A (en) | 2016-07-04 | 2016-07-04 | Universal real-time lossless data compression method of binary data encoded by utf-8 |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20180004409A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113283215A (en) * | 2021-07-15 | 2021-08-20 | 北京华云安信息技术有限公司 | Data confusion method and device based on UTF-32 coding |
-
2016
- 2016-07-04 KR KR1020160083901A patent/KR20180004409A/en unknown
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113283215A (en) * | 2021-07-15 | 2021-08-20 | 北京华云安信息技术有限公司 | Data confusion method and device based on UTF-32 coding |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101610609B1 (en) | Data encoder, data decoder and method | |
US8933826B2 (en) | Encoder apparatus, decoder apparatus and method | |
CN108737976A (en) | A kind of compression transmitting method based on Big Dipper short message | |
CN104467868A (en) | Chinese text compression method | |
CN103347047A (en) | Lossless data compression method based on online dictionaries | |
KR101667240B1 (en) | Secure and lossless data compression | |
KR20180004409A (en) | Universal real-time lossless data compression method of binary data encoded by utf-8 | |
KR20180004410A (en) | Universal real-time lossless data compression method of binary data encoded by utf-8 | |
KR101791877B1 (en) | Method and apparatus for compressing utf-8 code character | |
KR20180047738A (en) | UTF-8 character set compression method and apparatus thereof | |
KR101752281B1 (en) | Method and apparatus for compressing utf-8 code character | |
KR101791880B1 (en) | Method and apparatus for compressing utf-8 code character | |
RU2437148C1 (en) | Method to compress and to restore messages in systems of text information processing, transfer and storage | |
EP2113845A1 (en) | Character conversion method and apparatus | |
KR20180006011A (en) | HANGUL ON UTF-8 character set COMPRESSION Method and Appratus Thereof | |
KR20180006605A (en) | HANGUL ON UTF-8 character set COMPRESSION Method and Appratus Thereof | |
KR20180005386A (en) | Utf-8 code real-time lossless compression by using universal code method and appratus thereof | |
KR20180008034A (en) | HANGUL ON UTF-8 character set COMPRESSION Method and Appratus Thereof | |
KR20160049627A (en) | Enhancement of data compression rate by efficient mapping binary cluster with universal code based on frequency of binary cluster | |
KR20180008226A (en) | HANGUL ON UTF-8 character set COMPRESSION Method and Appratus Thereof | |
KR20180009060A (en) | HANGUL ON UTF-8 character set COMPRESSION Method and Appratus Thereof | |
KR20190091586A (en) | TCP/IP Packet data compression method and appratus based on binary compression method | |
KR20180007740A (en) | HANGUL ON UTF-8 character set COMPRESSION Method and Appratus Thereof | |
KR20180009088A (en) | HANGUL ON UTF-8 character set COMPRESSION Method and Appratus Thereof | |
KR101573983B1 (en) | Method of data compressing, method of data recovering, and the apparatuses thereof |