KR20210042439A

KR20210042439A - Binary data compression method and appratus thereof

Info

Publication number: KR20210042439A
Application number: KR1020190124967A
Authority: KR
Inventors: 김정훈
Original assignee: 김정훈
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2021-04-20

Abstract

According to the present invention, the bit gain of 1 bit from the compressed binary cluster can be utilized as it is by making [compressed binary cluster] + [data that starts with 1 (through proper data modulation or data as it is) and knows how many bits come after it through header bit analysis] + [other data bits to make a fixed bit width].

Description

Binary data compression method and apparatus thereof {BINARY DATA COMPRESSION METHOD AND APPRATUS THEREOF}

발명을 실시하기위한 구체적인 내용에 상술Detailed information for carrying out the invention

예를들어, 임의의 이진데이터가 있을때,For example, when there is arbitrary binary data,

001110111000001111100111000101111....001110111000001111100111000101111....

01을 만날때 마다 데이터를 분리하면 바이너리 클러스터라고 하고Whenever it meets 01, the data is separated and it is called a binary cluster.

001/1101/11000001/1111001/110001/01/111001/1101/11000001/1111001/110001/01/111

마지막에 01을 만나지 못한 데이터는 잉여데이터라고 하자,Let's say that the data that did not meet 01 at the end is the surplus data,

이 바이너리 클러스터는 최하위의 "1"을 제거하면,If you remove the lowest "1" from this binary cluster,

반드시 0으로 끝나는 이진데이터로 1비트 압축이 되고 이를 압축바이너리 클러스터라고 하자.Binary data that must end in 0 is compressed by 1-bit, and this is called a compressed binary cluster.

00/110/1100000/111100/11000/0 // 111 (잉여데이터)00/110/1100000/111100/11000/0 // 111 (surplus data)

압축바이너리 클러스터만 따로 떼어 생각해보면,Considering the compressed binary cluster separately,

이압축바이너리 클러스터를 종으로 일렬로 세우면, 아래와 같다.If this compressed binary cluster is lined up vertically, it is as follows.

0000

110110

11000001100000

111100111100

1100011000

0 0

이렇게 압축바이너리 클러스터를 세로로 세우고 각 행에, 1로 시작하면서, 헤더비트등을 통해서, 이후 몇비트가 존재하는지를 알수 있는 이진데이터열을 결합시킨다. 일례로 UTF-8 문자에 대한 이진데이터열은, In this way, the compressed binary cluster is set vertically, and in each row, starting with 1, through header bits, etc., a binary data string that can know how many bits are present is combined. For example, a binary data string for UTF-8 characters,

0 으로 시작하면 0xxxxxxx 으로서 전체 8비트, Starting with 0, 0xxxxxxx is a total of 8 bits,

110 으로 시작하면, 110xxxxx10xxxxxx 으로서 16비트Starting with 110, 16 bits as 110xxxxx10xxxxxx

1110 으로 시작하면, 1110xxxx10xxxxxx10xxxxxx 으로서 24비트,Starting with 1110, 24 bits as 1110xxxx10xxxxxx10xxxxxx,

11110 으로 시작하면, 11110xxx10xxxxxx10xxxxxx10xxxxxx 으로서 32비트로서Starting with 11110, 11110xxx10xxxxxx10xxxxxx10xxxxxx as 32 bits

헤더비트열을 분석하면, 이후 몇바이트가 같은 데이터덩어리인지 알수있다.By analyzing the header bit sequence, you can see how many bytes are the same data chunk.

이 경우, 0으로 시작하는 문자열앞에 1을 부가하면, 10xxxxxxx 으로 9비트로 만들면, 모든 케이스를 구분할수 있으면서도 1로 시작하여 하기와 같이 압축바이너리클러스터와 결합할수 있다. In this case, if 1 is added to the front of the string starting with 0, and made into 9 bits with 10xxxxxxx, all cases can be distinguished and combined with the compressed binary cluster as follows, starting with 1.

00 <UTF-8비트열>00 <UTF-8 bit string>

110<UTF-8비트열>110<UTF-8 bit string>

1100000<UTF-8비트열>1100000<UTF-8 bit string>

111100<UTF-8비트열>111100<UTF-8 bit string>

11000<UTF-8비트열>11000<UTF-8 bit string>

0 <UTF-8비트열>0 <UTF-8 bit string>

이제 각 행단위로 동일한 비트수를 만들기위한 other data들을 순차적으로 해당비트만큼만 가져와서 연결하면, 각 행모두 동일한 비트너비를 가지면서도 각 해단위에서 0비트에서 1비트씩의 비트이익이 발생한다. 이렇게 고정비트열이 되면, 하나의 행으로 모두 연결하여 결합하여 전송하고 수신측에서 고정비트열단위로 분리하여 각기 압축해제 할 수 있다.Now, if the other data for making the same number of bits in each row are sequentially fetched and connected as much as the corresponding bit, bit gains of 0 to 1 bit in each solution unit occur while each row has the same bit width. When it becomes a fixed bit string, it can be combined and transmitted in one row, and each can be decompressed by separating it into fixed bit columns at the receiving side.

00 <UTF-8비트열><....other data bits>00 <UTF-8 bit string> <....other data bits>

110<UTF-8비트열><....other data bits>110<UTF-8 bit string><....other data bits>

1100000<UTF-8비트열><other data bits>1100000<UTF-8 bit string> <other data bits>

111100<UTF-8비트열><.other data bits>111100<UTF-8 bit string> <.other data bits>

11000<UTF-8비트열><..other data bits>11000<UTF-8 bit string> <..other data bits>

0 <UTF-8비트열><.....other data bits>0 <UTF-8 bit string> <.....other data bits>

본 발명은 이와 같이, The present invention is thus,

[압축바이너리클러스터] + [(적절한데이터 변조를 통해 또는 데이터 그대로) 1로 시작하면서 헤더비트 분석을 통해 이후 몇비트가 오는지 알수있는 데이터] + [고정비트너비를 만들기 위한 other data bits] 를 만들면 압축바이너리 클러스터로부터 1비트의 비트이익이 그대로 활용될수 있는 발명이다.[Compressed Binary Cluster] + [(through appropriate data modulation or data as it is) data that starts with 1 and knows how many bits are coming after header bit analysis] + [other data bits to make a fixed bit width] It is an invention that can utilize the bit profit of 1 bit from the binary cluster as it is.

특히 중간에 [(적절한데이터 변조를 통해 또는 데이터 그대로) 1로 시작하면서 헤더비트 분석을 통해 이후 몇비트가 오는지 알수있는 데이터] 는 기존 압축알고리즘 데이터 또는 같은 특성을 가진 어떤 데이터도 무방하다.In particular, in the middle [data that starts with 1 (through appropriate data modulation or data as it is) and lets you know how many bits are coming after header bit analysis], existing compression algorithm data or any data with the same characteristics can be used.

중간데이터에 대한 또다른 실시례로서, deflate 를 들자면, deflate는 매우 대중적인 압축알고리즘이고 국제표준인데,As another example for intermediate data, to take deflate, deflate is a very popular compression algorithm and is an international standard.

내부적으로는 multi-block 구조로 되어 있고, 이를 각 블럭별로 구분 할 수 있으며, deflate를 정의한 RFC-1951을 참고하여 간단한 변조과정을 거치면 아래와 같이 각 블럭의 첫바이트에 헤더비트가 표현하고 end-of-block이 끝에 존재하는 형태로 변형이 가능하다(end_of_block은 허프만 코드로서 뒤에 임의의 코드가 붙어도 유일하게 앞 데이터에 이어서 순차적으로 분리되어 확인된다), deflate의 특성상 바이트의 LSB(least significant bits)에서부터 데이터가 패킹되어, 첫번째바이트의 최하위 비트가 마지막 블럭인지(=1), 중간블럭인지(=0)을 나타내고 이후 2비트가 (10, 11, 01, 00 이 압축방법 및 예약비트를 나타낸다)Internally, it has a multi-block structure, and it can be classified for each block. If a simple modulation process is performed by referring to RFC-1951, which defines deflate, a header bit is expressed in the first byte of each block and end-of -Block can be transformed into a form at the end. Data is packed, and the least significant bit of the first byte indicates whether it is the last block (=1) or the middle block (=0), and the subsequent 2 bits (10, 11, 01, 00 indicate the compression method and reserved bits)

상기 데이터를 약간의 변조과정을 거쳐, 마지막블럭여부를 나타내는 최하위 비트를 최상위로 옮기고, 중간블럭들이 대부분일텐데 이것이 0으로 되어 있으므로 1로 세팅하고, 마지막블럭은 그대로 1 으로 세팅하여 아래와같이 변조한다. 물론 다른 비트들이 같이 따라온다거나 순서를 역방향으로 하는 형태의 변조도 가능하다, 중요한것은 중간블럭의 헤더의 첫비트가 0에서 1로 세팅되고,끝 블럭을 나타내는 헤더비트는 1그대로 1 이되고, 맨앞으로 온다는 점이다. 끝블럭을 구분하는 방법은, 블럭의 전체 갯수를 압축데이터에 포함한다거나, 디코딩과정에서 남아있는 데이터의 양을 보고, 고정비트열 비트길이만큼 남아있다면, 마지막 블럭임을 자연스럽게 알수있다.After some modulation of the data, the least significant bit indicating whether or not the last block is moved to the highest level, and most of the intermediate blocks are set to 1 because this is 0, and the last block is set to 1 as it is and modulated as follows. Of course, it is also possible to modulate in the form of other bits following together or in reverse order. Importantly, the first bit of the middle block header is set from 0 to 1, and the header bit indicating the end block is 1 as it is. It comes to the forefront. The method of classifying the last block is to include the total number of blocks in the compressed data, or look at the amount of data remaining in the decoding process, and if it remains as long as the bit length of a fixed bit string, it can be naturally known that it is the last block.

다음으로 상기 바이너리 클러스터를 규칙에 따라 압축바이너리 클러스터로 만든다.Next, the binary cluster is made into a compressed binary cluster according to the rules.

이 압축바이너리 클러스터 각각을 상기 deflate의 data block과 결합시킨다. 그리고 상기에서처럼, 고정비트열을 만들기 위한 other data bit들을 순차적으로 필요한 양만큼 가져와서 결합시킨다. 결합된 데이터는 1줄의 일렬 데이터로 결합하여 전송하고, 고정비트열만큼 수신측에서는 떼서 압축해제한다.Each of these compressed binary clusters is combined with the data block of the deflate. And, as in the above, other data bits for making a fixed bit string are sequentially fetched and combined as needed. The combined data is combined into one line of data in a row and transmitted, and the receiving side separates and decompresses as much as a fixed bit string.

하기 그림처럼 0을 만났다가 다시 1을 처음 만날때가 헤더비트의 시작점이고, 이전이 압축바이너리 클러스터 영역이다. 이 헤더비트에서부터 8비트를 떼서, 최상위 비트 1 ==> 0 으로 두고, 다시 최하위 비트로 이동시키면원래의 deflate의 헤더비트열의 시작이 되고 defalte의 decoding규칙을 적용해가면 자연스럽게 end-of-block까지가 압축데이터 블럭으로 구분되어 압축해제하면 원본이 복구된다.other data bit도 각기 순차적으로 결합시키면 원본이 복구된다. 압축바이너리 클러스터도 최하위 비트이후에 1을 결합하여 원본 바이너리 클러스터로복구한다.As shown in the figure below, the first time it encounters 0 and then 1 again is the starting point of the header bit, and the previous is the compressed binary cluster area. Separating 8 bits from this header bit, leaving the most significant bit as 1 ==> 0, and moving to the least significant bit again, the header bit string of the original deflate starts, and if the decoding rule of defalte is applied, the end-of-block naturally goes up. The original is restored when it is divided into compressed data blocks and decompressed, and the original is restored by sequentially combining other data bits. The compressed binary cluster is also restored to the original binary cluster by combining 1 after the least significant bit.

끝블럭을 구분하는 방법은, 상기 설명대로 블럭의 전체 갯수를 압축데이터에 포함한다거나, 디코딩과정에서 남아있는 데이터의 양을 보고, 고정비트열 비트길이만큼 남아있다면, 마지막 블럭임을 자연스럽게 알수있다.As for the method of classifying the last block, as described above, if the total number of blocks is included in the compressed data, or if the amount of data remaining in the decoding process is viewed and remains as long as the bit length of the fixed bit string, it can be naturally known that it is the last block.

Claims

As a pre-application for claiming domestic priority, no separate claim scope has been stated.