KR20180002145A

KR20180002145A - Utf-8 code real-time lossless compression by using universal code method and appratus thereof

Info

Publication number: KR20180002145A
Application number: KR1020160081235A
Authority: KR
Inventors: 김정훈
Original assignee: 김정훈
Priority date: 2016-06-29
Filing date: 2016-06-29
Publication date: 2018-01-08

Abstract

The present invention is provided to achieve a real-time lossless compression. According to the present invention, [1] 110, a header of a first byte, is replaced with 101 and a bit compression effect is obtained; [2] next, 1110, a header of the first byte, is replaced with 1001 and a compression effect of 4 bits is obtained; [3] next, 11110, a header of the first byte, is replaced with 1101, a compression effect of 1 bit is obtained, and a compression effect of 6 bits is obtained; [4] next, 111110, a header of the first byte, is replaced with 10001, a compression effect of 1 bit is obtained, and a compression effect of 11 bits is obtained in overall; and [5] next, 1111110, a header of the first byte, is replaced with 11001, a compression effect of 2 bits is obtained, and a compression effect of 14 bits is obtained in overall.

Description

{UTF-8 CODE REAL-TIME LOSSLESS COMPRESSION BY USING UNIVERSAL CODE METHOD AND APPRATUS THEREOF USING UNIVERSAL CODE OF DATA CONTAINING UTF-

UTF-8코드 , 문자셋,유니버설코드, 압축기술UTF-8 code, character set, universal code, compression technology

발명을 실시하고자 하는 구체적인 내용에 상술Specification to concrete contents to carry out invention

UTF-8코드는 아스키코드체계와 다국어코드를 모두 포함하기위한 전세계적 표준코드이며, 우리나라에서는 한글을 표현하기 위한 한글 표준엔코딩 방법이며, 네이버의 검색엔진에서도 기준코딩으로 쓸만큼 매우 범용적으로 활용되고 있는 코드이다. 아래는 웹에서 UTF-8 코드의 활용빈도를 나타낸다.UTF-8 code is a global standard code to include both ASCII code and multilingual code. In Korea, Korean standard encoding method is used to represent Hangul, and Naver's search engine is very universal It is the code that is becoming. Below is the frequency of UTF-8 code usage on the web.

본 발명은 이러한 UTF-8코드의 엔코딩 체계를 유니버설 코드와 접목하여 실시간 무손실 압축을 달성하기 위한 발명이다. 아래그림은 UTF-8의 코딩규칙에 대한 설명이다.The present invention is an invention for achieving real-time lossless compression by combining an encoding scheme of such a UTF-8 code with a universal code. The following figure shows the coding rules of UTF-8.

코드포인터 U+0000~U+007F까지는 0으로 시작하고, 이후 7비트가 실질적인 코드이다.The code pointer U + 0000 to U + 007F starts with 0, and the next 7 bits are the actual code.

한편, 코드 포인터에 있어서는 범위에 따라 BYTE 1에 110, 1110, 11110, 111110, 111110 의 패턴으로 구성되며, 이후 바이트는 1~6바이트 까지 가변적으로 읽는데, 반드시 10으로 시작하도록 구성되어 있다. On the other hand, the code pointer is composed of patterns 110, 1110, 11110, 111110, and 111110 in BYTE 1 according to the range, and the byte is variably read from 1 to 6 bytes.

BYTE 1 이 110 으로 시작하면, 다음 5비트를 추가로 읽어서 byte 1 을 구성하고, 이후 추가적으로 byte 2를 읽는다.If BYTE 1 begins with 110, the next 5 bits are read further to form byte 1, and then additional byte 2 is read.

BYTE 1 이 1110 으로 시작하면, 다음 4비트를 추가로 읽어서 BYTE 1 을 구성하고, 이후 2바이트를 추가로 읽는다. 아래의 그림에 따른 규칙으로 UTF-8코드가 엔코딩된다.When BYTE 1 starts with 1110, the next 4 bits are further read to construct BYTE 1, and then two additional bytes are read. The UTF-8 code is encoded as a rule according to the figure below.

실질적인 코드는 이렇게 읽어온뒤, x 로 표시된 코드를 조합하여 원본데이터를 decoding하게 된다.After the actual code is read, the code marked with x is combined to decode the original data.

본 발명은, BYTE 1에 있어서 0 으로 시작하는 바이트는 별도의 압축과정을 거치지 않는다. 주 관심사는 Byte 1 에 있어서, 1로 시작하는 헤더부분에 대하여 유니버설 코드로 대체하는데,According to the present invention, the bytes starting with 0 in BYTE 1 are not subjected to a separate compression process. The main concern is that in Byte 1, the header part starting with 1 is replaced by the universal code,

[1] U+0080~U+07FF 포인터에 대응하는 첫번째 바이트(BYTE 1 ) 의 [1] U + 0080 ~ U + 07FF The first byte corresponding to the pointer (BYTE 1)

(1) (One) 헤더인 110 를 101 로The header 110 is set to 101 대체하고, Instead,

(2) 두번째 바이트의 첫 2비트를 제거하여 U+0080~U+07FF에 있어서 2비트 압축효과를 얻는다.(2) By removing the first 2 bits of the second byte , a 2-bit compression effect is obtained in U + 0080 to U + 07FF.

[2] 다음으로 U+0800~U+FFFF 포인터에 대응하는 첫번째 바이트의 [2] Next, the first byte corresponding to the U + 0800 ~ U + FFFF pointer

(1) 헤더인 1110 (1) header 1110 를 1001 로To 1001 대체하고, Instead,

(2) (2) 두번째second 및 And 세번째third 바이트의 첫 First byte 2 비트씩을Two bits each 제거하여 4비트의 압축효과를 얻는다. To obtain a compression effect of 4 bits.

[3] 다음으로 U+10000~U+1FFFF 포인터에 대응하는 첫번째 바이트의 [3] Next, the first byte corresponding to the U + 10000 ~ U + 1FFFF pointer

(1) 헤더인 11110 를 1101 로 대체하여 1비트의 압축효과를 얻고(1) By replacing the header 11110 with 1101, a compression effect of 1 bit is obtained

(2) 두번째, 세번째, 네번째 바이트의 첫 2비트인 "10"을 각각 모두 제거하여 6비트의 압축효과를 얻는다. (2) The first two bits of the second, third, and fourth bytes, "10", are all removed, resulting in a 6-bit compression effect.

전체적으로는 7비트의 압축효과를 얻는다.Overall, a compression effect of 7 bits is obtained.

[4] 다음으로 U+200000~U+3FFFFFF 포인터에 대응하는 첫번째 바이트의 [4] Next, the first byte corresponding to the U + 200000 ~ U + 3FFFFFF pointer

(1)헤더인 111110 을 10001 로 대체하여 1비트의 압축효과를 얻고 (1) By replacing header 111110 with 10001, a compression effect of 1 bit is obtained

(2)두번째부터 다섯번째 바이트의 최상위 2비트를 모두 제거하여 10비트의 압축효과를 얻어서 전체적으로 11비트의 압축효과를 얻는다. (2) By removing all the most significant 2 bits of the second to fifth bytes, a 10-bit compression effect is obtained, thereby obtaining an 11-bit compression effect as a whole.

[5] 다음으로, U+4000000~U+7FFFFFFF 포인터에 대응하는 첫번째 바이트의 [5] Next, the first byte corresponding to the pointer U + 4000000 to U + 7FFFFFFF

(1)헤더인 1111110 을 11001 로 대체하여 2비트의 압축효과를 얻고 (1) By replacing the header 1111110 with 11001, a compression effect of 2 bits is obtained

(2)두번째부터 여섯섯번째 바이트의 최상위 2비트를 모두 제거하여 12비트의 압축효과를 얻어 전체적으로 14비트의 압축효과를 얻는다. (2) By removing all the 2 most significant bits of the second to sixth bytes, a compression effect of 12 bits is obtained and a compression effect of 14 bits as a whole is obtained.

즉 첫번째 바이트의 최상위 비트가 1로 시작하는 경우의 헤더는 UTF-8의 경우That is, if the first bit of the first byte starts with 1, the header is UTF-8

110110

11101110

1111011110

111110111110

11111101111110

이며 물론 계속적으로 1이 덧붙여 져서 확장될 여지도 있다.Of course, there is also a possibility that 1 is added continuously and expanded.

이때 헤더들은, 각각 아래와 같은 유니버설 코드로 변환되어 첫번째 바이트의 헤더부가 압축되고, 이후 바이트들은 첫번째 바이트의 헤더의 종류에 따라 읽어올 바이트수가 가변적으로 바뀌는데, 일단 읽어와야할 후속 가변 바이트들의 최상위 비트 2비트를 제거한다.At this time, the headers are converted into the following universal codes, and the header part of the first byte is compressed. Then, according to the type of the header of the first byte, the number of bytes to be read is changed variable. The most significant bit 2 of the subsequent variable bytes to be read Remove the bit.

101101

10011001

11011101

1000110001

1100111001

상기 본 발명에서 제안되는 유니버설 코드의 특징은 아래와 같다.The features of the universal code proposed in the present invention are as follows.

1로 시작하고, “01”로 끝나는 유니버설 코드의 형태인데,It is in the form of a universal code starting with 1 and ending with "01"

이 코드는 다른 코드(This code can be used in other code 예를들어E.g suffix로서 direct binary code)와 결합가능하고 그때에 이 코드는 prefix로 위치한다. suffix as a direct binary code), and this code is then placed in the prefix.

“1”로 시작하는 유니버설 코드이므로, “0”으로 시작하는 문자코드와 완벽하게 구분된다.Since it is a universal code starting with "1", it is completely separated from the character code starting with "0".

이러한 유니버설 코드의 경우의 유일복호성은 다음과 같다.In the case of this universal code, the only decoding property is as follows.

예를들어, 송신측에서 아래와 같이 전송하여,For example, the transmitting side transmits as follows,

101 / 1001 / 11001 / 11101 / 101 / 1001 101/1001/11001/11101/101/1001

수신측에서,On the receiving side,

101100111001111011011001 과 같이 수신되었다면,If received as 101100111001111011011001,

수신측에서는, "01"을 처음만날때마다 "01" 다음에서 각각 분할하면 유니버설 코드를 동일하게 분할해낼수 있다.On the receiving side, you can divide the universal code equally by dividing each of them from "01" next to "01" each time they are first encountered.

101/1001/11001/11101/101/1001101/1001/11001/11101/101/1001

상기와 같이 송신과 수신측에서 보내려했던 유니버설 코드가 그대로 분리됨을 알수있다.As described above, it can be seen that the universal codes to be transmitted from the transmitting and receiving sides are directly separated.

이러한 유니버설 코드는 다음과 같다.These universal codes are as follows.

순번MOrder number M 유니버설코드Universal code 1One 101101 22 10011001 33 11011101 44 1000110001 55 1100111001 66 1110111101 77 100001100001 88 110001110001 99 111001111001 1010 111101111101 1111 10000011000001 1212 11000011100001 1313 11100011110001 1414 11110011111001 1515 11111011111101 1616 1000000110000001 1717 1100000111000001 1818 1110000111100001 1919 1111000111110001 …... …...

101, 1001, 10001, 100001 과같이, 최상위와 최하위가 “1”로 구성되고 중간에 “0” K개이상 포함된 K군 seed code가 각 군의 처음에 배치되어 있고, 이후에 순번이증가하면서, 이 seed code를 변형하여 유니버설 코드가 전개되는데,K group seed codes having the highest and lowest positions of "1" and "K" of "0" in the middle are arranged at the beginning of each group as in the case of 101, 1001, 10001 and 100001, , The universal code is developed by modifying this seed code,

예를들어, k 군에 있어서, 순번이 증가함에 따라 최상위 다음비트부터 하나씩 “0”을 “1”로 바꾸어 채워가다가, 최하위비트로부터 2비트자리까지 모두 “1”로 차면, k+1 군의 seed code가 다음 유니버설 코드가 된다.For example, in the group k, when the sequence number is increased, "0" is changed to "1" one by one from the most significant bit next to the most significant bit, The seed code becomes the next universal code.

즉, 3군의 예를들면,That is, for example, in the third group,

10001 이 3군 seed code이며, 다음 코드가 11001, 다음이 11101 이 되며, 최하위비트 2비트까지 “1”로 모두 찾으므로, 다음 코드는 4군 seed code인 100001 의 순서가 된다.10001 is the third group seed code, and the next code is 11001, then the next code is 11101, and all the bits to the least significant 2 bits are found as "1".

UTF-8 의 표준적인 첫바이트가 1로시작하는 헤더들의 변환테이블은 아래와 같다.The conversion table of headers whose standard first byte of UTF-8 starts with 1 is as follows.

UTF-8코드체계계 헤헤더더The UTF-8 code system 변변환환될될 압압축축용용 유유니니버버설설 코코드드To convert the pressure to be compressed, 110110 101101 11101110 10011001 1111011110 11011101 111110111110 1000110001 11111101111110 1100111001 ….... . …..... ..

한편, 아래 그림은 UTF-8 코드체계에서 한글이 위치한 코드체계 영역을 나타낸 그림이다. 첫번째 바이트의 헤더가 1110 인데, 우리나라는 한글이 다빈도로 사용되므로, 상기표3에서 처럼 순차적이라면, 아래 표 4와 같이 1001 이라는 유니버설 코드에 매핑되어야하는데, 이를 1비트 적은 101 에 매핑하고, 110 이라는 헤더를 1001 이라는 유니버설 코드로 매핑하는 형태로 보다 더 문자체계나 다빈도 발생체계에 따라 압축 및 압축해제 프로토콜로서 약속하여 더욱 높은 압축률을 보일 수 있음은 물론이다.On the other hand, the figure below shows the code system area where the Hangul is located in the UTF-8 code system. Since the header of the first byte is 1110 and Korean is used as a dictionary, if it is sequential as shown in Table 3, it should be mapped to a universal code of 1001 as shown in Table 4 below. The header is mapped to a universal code of 1001, and moreover, a higher compression rate can be achieved due to a promise as a compression and decompression protocol according to a character system or a complex system.

UTF-8코드체계계 헤헤더더The UTF-8 code system 변변환환될될 압압축축용용 유유니니버버설설 코코드드To convert the pressure to be compressed, 110110 10011001 11101110 101101 1111011110 11011101 111110111110 1000110001 11111101111110 1100111001 ….... . …..... ..

이제 상기와 같은 방식으로 압축된 UTF-8코드의 압축해제방법을 설명하고자 한다.Now, a decompression method of the compressed UTF-8 code in the above-described manner will be described.

먼저, 이와 같이 압축된 UTF_8코드 데이터들에 있어서,First, in the compressed UTF_8 code data,

[1] 최상위 비트가 중요한데, "0"으로 시작하면, 압축되지 않은것으로서, 이후 7비트를 그대로 읽어서, U+0000~U+007F 까지의 포인터에 대응하는 UTF-8 코드 원본으로 즉시 사용된다.[1] The most significant bit is important. When it starts with "0", it is uncompressed. It reads 7 bits as it is, and is immediately used as a UTF-8 code source corresponding to pointers from U + 0000 to U + 007F.

[2] 최상위 비트가 "1"로 시작한다면, 상기 유니버설 코드의 해독 규칙에따라 "01"을 처음 만날때 "01"다음에서 분할하여 유니버설 코드부를 먼저 해독한다.[2] If the most significant bit starts with "1", "01" is divided according to the decoding rule of the universal code at "01" next to the beginning, and then the universal code is first decoded.

아래의 예시는 U+0800~U+FFFF 의 UTF_8코드 압축결과이다The following example is the result of UTF_8 code compression of U + 0800 ~ U + FFFF

1001xxxx/xxxxxx/xxxxxx (/ 는 가상의 구분자임)1001xxxx / xxxxxx / xxxxxx (/ is a virtual delimiter)

1001 xxxx/xxxxxx/xxxxxx (/ 는 가상의 구분자임) 1001 xxxx / xxxxxx / xxxxxx (/ is a virtual delimiter)

따라서 먼저 1001 이 분할되면, 이는 표3에서 처럼 "1110" 이 원본 첫번째 바이트의 헤더가 되므로, UTF-8 엔코딩 규칙에 따라 이후 4비트를 읽어서 첫번째 바이트를 복호하고, 가변바이트는 2바이트를 읽어오는 것이 규칙이므로(첫번째 바이트의 헤더가 1110 일경우에, 이후 2비트씩 압축되어 있으므로, 6비트씩 2번 읽고, 각 6비트의 최상위앞에 "10" 이라는 2비트를 추가하여 각각 8비트로 복호화하여 원래의 UTF-8 code로 압축해제한다. 즉 아래와 같이 도식화할 수 있다.Thus, if 1001 is divided first, this is because "1110" becomes the header of the original first byte as in Table 3, so the next 4 bits are read in accordance with the UTF-8 encoding rule to decode the first byte, and the variable byte reads 2 bytes (Since the header of the first byte is 1110, since it is compressed by 2 bits, it reads 2 times by 6 bits, adds 2 bits of "10" to the top of each 6 bits, Unpack it into UTF-8 code, which is shown below.

이제 다시 처음으로 돌아가 읽어들이고 복호화한 이후비트의 최상위 비트값이 "0"인지 "1"인지에 따라서 상기과정을 반복하면서 압축해제 하여 원래의 UTF-8코드로 압축해제 한다.After reading and decoding the data, the above process is repeated according to whether the most significant bit value of the bit is "0" or "1", decompressing the data into the original UTF-8 code.

한편, 본 발명에서의 유니버설 코드의 일반화 방법은 아래와 같다.On the other hand, a universal code generalization method in the present invention is as follows.

즉 순번 1 부터 순번 값으로부터 직접 바로 코드를 연산으로 계산할 수 있다.That is, the code can be directly calculated from the sequence number directly from the sequence number.

순번MOrder number M 유니니버버설설코코드드Uni-buster sticking code code 1One 101101 22 10011001 33 11011101 44 1000110001 55 1100111001 66 1110111101 77 100001100001 88 110001110001 99 111001111001 1010 111101111101 1111 10000011000001 1212 11000011100001 1313 11100011110001 1414 11110011111001 1515 11111011111101 1616 1000000110000001 1717 1100000111000001 1818 1110000111100001 1919 1111000111110001 2020 1111100111111001 2121 1111110111111101 2222 100000001100000001 2323 110000001110000001 2424 111000001111000001 2525 111100001111100001 2626 111110001111110001 2727 111111001111111001 2828 111111101111111101 2929 10000000011000000001 3030 11000000011100000001 3131 11100000011110000001 3232 11110000011111000001 3333 11111000011111100001 3434 11111100011111110001 3535 11111110011111111001 3636 11111111011111111101 3737 1000000000110000000001 3838 1100000000111000000001 3939 1110000000111100000001 4040 1111000000111110000001 4141 1111100000111111000001 4242 1111110000111111100001 4343 1111111000111111110001 4444 1111111100111111111001 4545 1111111110111111111101 4646 100000000001100000000001 4747 110000000001110000000001 4848 111000000001111000000001 4949 111100000001111100000001 5050 111110000001111110000001 5151 111111000001111111000001 5252 111111100001111111100001 5353 111111110001111111110001 5454 111111111001111111111001 5555 111111111101111111111101 5656 10000000000011000000000001 …... …...

한편 특정한 임의의 정수인 순번 M 어떤 유니버설 코드로 매핑될지에 대하여 K 및 X값을 구해야 하는데, 다음과 같은 과정을 거친다.On the other hand, it is necessary to find the K and X values as to the specific random integer M which is to be mapped to the universal code.

i)

을 통해 K'를 계산한 뒤,i)

And then calculates K '

ii)K'이 정수( 양의정수 )이면, K = K'-1,ii) if K 'is an integer ( positive integer ), then K = K'-1,

K'가 정수(양의정수)가 아니면 K=f(K')If K 'is not an integer (positive integer) then K = f (K'

(단, f(x)는 x의 소수점을 버리는 함수, x>=0 일 경우)(Where f (x) is a function that discards the decimal point of x, and x> = 0)

참고로 K' 가 정수인지를 구분하기 위한 하나의 방법으로서For reference, as one method for distinguishing whether K 'is an integer

f(K')= K' 이면, K'은 정수이고, If f (K ') = K', K 'is an integer,

f(K')≠ K' 이면, K'은 정수가 아니다. If f (K ') ≠ K', then K 'is not an integer.

또는 i), ii)과정을 하나의 수식으로 표현하면 아래와 같다.Or i), ii) The process is expressed as a formula.

iii) 이와 같이 K를 구하고, M은 주어졌으므로, 아래와 같은 수식을 이용하여 X를 구한다.iii) Since K is obtained and M is given in this way, X is obtained by using the following formula.

결과적으로, 상기 수학식을 통해 M순번으로부터 K값 및 X값을 계산할 수 있고, 순번 M인 유니버설코드는 As a result, the K value and the X value can be calculated from the M order by the above equation, and the universal code having the order M

유니버설코드= 최상위의 "Universal code = top level " 1" 에1 " 이어서 X 개의 연속된 "1" 그리고 K- Then X consecutive "1" and K- X 개의X 연속된 "0" 그리고 최하위의 "1" 로 구성된 코드이다. Quot; 0 "and the lowest one" 1 ".

유니버설코드= "1" 에어서 X 개의 연속된 "1" 그리고 K-X 개의 연속된 "0" 그리고 최하위의 "1"Universal code = "1" X consecutive "1" and K-X consecutive "0" s and the lowest one "1"

이러한 공식에 따라 생성된 유니버설 코드를 decoding하는 방법은 아래와 같다.The method of decoding the generated universal code according to this formula is as follows.

즉 유니버설 코드의 전체길이에서 2를 뺀 값이 K 이고,That is, the value obtained by subtracting 2 from the total length of the universal code is K,

T = 유니버설 코드의 최상위에서 2번째 비트로부터 최하위 방향으로 보면서 "0"을 처음T = "0" in the lowest direction from the second bit at the top of the universal code

만날때까지의 "1"의 갯수임. 이때 최상위에서 2번째 비트이하임이 중요함.The number of "1" s until the meeting. At this time, it is important to be less than the second bit from the top.

이때 순번 M은 아래와 같다.The order M is as follows.

예를들어,E.g,

유니버설 코드 "1101"의 경우, K = 2 이고, T=1 이므로In the case of the universal code "1101 ", since K = 2 and T = 1

상기공식에 대입하면, M=3이므로, 이는 표5와 같이 동일한 순번이 바로 도출되었음을 알 수 있다.Substituting into the above formula, M = 3, so that it can be seen that the same sequence number is derived as shown in Table 5.

유니버설 코드 "100001" 의 경우, K=4, T = 0 이므로,In the case of the universal code "100001 ", since K = 4 and T = 0,

M = 7 이므로 완벽히 순번이 도출됨을 확인할 수 있다.Since M = 7, it can be confirmed that the order is completely derived.

Claims

Domestic priorities, and PCT application claims.