KR101612281B1

KR101612281B1 - Binary data compression and restoration method and apparatus

Info

Publication number: KR101612281B1
Application number: KR1020150038197A
Authority: KR
Inventors: 김정훈
Original assignee: 김정훈
Priority date: 2015-03-19
Filing date: 2015-03-19
Publication date: 2016-04-14

Abstract

The present invention relates to a method and an apparatus for compressing and restoring binary data to rapidly and efficiently compress and restore the binary data through simple operation and hardware configuration. A binary data compression method includes the steps of: scanning, by a compression unit, raw binary data in a first direction by n-bit unit and obtaining cyclic blocks having n-bit symbols, wherein each cyclic block is a binary block having the symbols encountered until all different 2^n types of symbols appear once or more when the raw binary data is sequentially scanned in the first direction; obtaining, by the compression unit, the cyclic groups respectively including one or more cyclic blocks based on a distribution chart of frequencies of respective symbols appearing in one or more cyclic blocks, or neighboring cyclic symbols; compressing, by the compression unit, the respective cyclic groups with a Huffman coding process for the symbols in the respective cyclic groups and then generating compressed cyclic groups; generating, by the compression unit, compressed data by combining the compressed cyclic groups.

Description

TECHNICAL FIELD [0001] The present invention relates to a binary data compression and restoration method and apparatus,

본 발명은 이진 데이터의 압축 및 복원 방법과 장치에 관한 것으로서, 보다 구체적으로는 간단한 연산과 하드웨어적 구성을 통해 이진 데이터를 효과적이고 효율적으로 압축하고 복원할 수 있을 뿐만 아니라 데이터 전송 속도와 효율도 향상시킬 수 있는 이진 데이터의 압축 및 복원 방법과 장치에 관한 것이다.
The present invention relates to a method and apparatus for compressing and restoring binary data, and more particularly, to an apparatus and method for efficiently and efficiently compressing and restoring binary data through a simple operation and a hardware configuration, And more particularly to a method and apparatus for compressing and restoring binary data.

일반적으로, 통상의 전송 채널에서 이용 가능한 주파수 대역폭은 제한되어 있으므로, 많은 양의 데이터를 전송하기 위해서 모뎀과 같은 다양한 전송 시스템은 전송 데이터의 양을 압축하거나 줄일 수 있는 효과적인 데이터 압축 기법을 이용해 왔다.In general, since the frequency bandwidth available in a normal transmission channel is limited, various transmission systems such as a modem have used an effective data compression technique to compress or reduce the amount of transmission data in order to transmit a large amount of data.

다양한 압축기법 중의 하나로서, 국제 전기 통신 동맹(ITU : International Telecommunication Union)에 의해 표준화된 부호화 알고리즘으로, 모뎀과 같은 데이터 전송 시스템에서 채용하고 있는 CCITT V.42 bis 가 있다. 이 부호화 표준안에 적용된 기초는 Ziv-Lempel code(ZLC)이며, 이 방식은 입력 데이터로부터 적응적으로 사전을 형성해 가면서 앞의 입력 데이터와 동일한 구문(phrase)이 저장되어 있는 사전의 주소값을 부호어로 전송하는 방법이다. 사전화(dictionary) 작업은 입력 데이터와 계속적인 스트링 매칭(string matching)을 수행하여 최대 길이의 매칭 스트링에 매칭안된 문자를 결합하여 사전에 추가하는 과정으로 사전을 업데이트한다.One of the various compression schemes is the CCITT V.42 bis employed in a data transmission system such as a modem with a coding algorithm standardized by the International Telecommunication Union (ITU). The basis applied to this coding standard is a Ziv-Lempel code (ZLC). In this method, an address value of a dictionary storing the same phrase as the previous input data is formed as a codeword while adaptively forming a dictionary from the input data. Lt; / RTI > The dictionary operation performs a continuous string matching with the input data to update the dictionary by adding the unmatched characters to the maximum matching string and adding them to the dictionary.

그러나, 이러한 종래의 압축 방식은 데이터의 압축 및 복원에 대한 처리 연산이 복잡하고 비교적 고사양의 하드웨어적 장치를 필요로 하며, 처리 속도의 향상에 제한이 따르고 압축 결과값에 대한 신뢰성을 높이기 힘든 문제점이 있었다.
However, such a conventional compression method requires complicated processing of data compression and decompression, requires a relatively high-performance hardware device, limits the improvement of the processing speed, and increases the reliability of the compression result value there was.

본 발명의 배경기술은 대한민국 공개특허공보 제 1999-0022960호(1999. 3. 25 공개)에 개시되어 있다.
The background art of the present invention is disclosed in Korean Patent Laid-Open Publication No. 1999-0022960 (published on Mar. 25, 1999).

본 발명이 이루고자하는 기술적 과제는, 간단한 연산과 하드웨어적 구성을 통해 이진 데이터를 신속하고 효율적으로 압축하고 복원할 수 있고, 압축률도 뛰어나며 압축 데이터 및 복원 데이터의 신뢰성도 높일 수 있을 뿐만 아니라 데이터 전송시 전송효율과 속도도 향상시킬 수 있는 이진 데이터의 압축 및 복원 방법과 장치를 제공하는 데에 있다. 특히 허프만 부호화방식의 압축률을 보다 더 효과적으로 높이고자 함에 있다.
SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and it is an object of the present invention to provide a data compression method and a data compression method that can compress and restore binary data quickly and efficiently through simple computation and hardware configuration, And a method and apparatus for compressing and restoring binary data that can improve transmission efficiency and speed. And more particularly to enhance the compression ratio of the Huffman coding scheme more effectively.

본 발명의 일 측면에 따르면, 본 발명은 압축장치의 이진 데이터 압축 방법으로서, 압축부가 원본 이진데이터를 제 1방향으로 n비트 단위로 스캐닝하여 n비트길이의 복수의 심볼을 갖는 복수의 순환블럭을 획득하는 단계로서, 각각의 상기 순환블럭은 상기 원본 이진데이터를 상기 제 1방향으로 순차적으로 스캐닝할 때 상이한 2ⁿ 종류의 상기 심볼이 적어도 한번 이상 모두 나타날 때까지 만나는 심볼들을 갖는 이진수블럭인, 순환블럭 획득단계; 상기 압축부가 하나의 순환블럭 또는 이웃하는 복수의 순환블럭 내에서 각각의 상기 심볼들이 나타나는 빈도수들의 산포도에 근거하여 적어도 하나의 순환블럭을 각각 포함하는 복수의 순환그룹을 획득하는 단계; 상기 압축부가 상기 각각의 순환그룹 내의 상기 심볼들에 대해 허프만 부호화를 실시하여 상기 순환그룹 각각을 압축하여 복수의 압축순환그룹을 생성하는 단계; 및 상기 압축부가 상기 복수의 압축순환그룹을 결합하여 압축데이터를 생성하는 단계를 포함하는 것을 특징으로 하는 이진데이터 압축방법을 제공한다.According to one aspect of the present invention, there is provided a method of compressing binary data in a compression apparatus, the compression unit comprising: a first step of scanning the original binary data in units of n bits in a first direction to generate a plurality of cyclic blocks each having a plurality of symbols each having n bits Wherein each of said circular blocks is a binary block having symbols that meet until at least one or more of the 2 ^{< n >} different symbols of the original binary data are sequentially scanned in the first direction, Block acquisition step; Obtaining a plurality of cyclic groups each including at least one cyclic block based on the scattering of the frequencies at which each of the symbols appears within one cyclic block or a plurality of neighboring cyclic blocks; The compression unit compresses each of the cyclic groups by performing Houghman coding on the symbols in the respective cyclic groups to generate a plurality of compression cyclic groups; And compressing the compressed data by combining the plurality of compression cyclic groups to generate compressed data.

본 발명에서, 상기 순환그룹을 획득하는 단계는, 현재의 순환블럭 내에서 상기 심볼들이 나타나는 빈도수들의 제 1산포도를 연산하는 단계; 상기 제 1산포도를 기준 산포도와 비교하는 단계; 및 상기 제 1산포도가 기준 산포도 이상이면 상기 현재의 순환블럭을 상기 순환그룹으로서 획득하는 단계를 포함하는 것을 특징으로 한다.In the present invention, the step of acquiring the cyclic group may include calculating a first scatterplot of the frequencies at which the symbols appear in the current cyclic block; Comparing the first scatter diagram with a reference scatter diagram; And obtaining the current circulating block as the circulating group if the first scattering degree is equal to or greater than the reference scattering degree.

본 발명에서, 상기 제 1산포도가 상기 기준 산포도 미만이면, 상기 현재의 순환블럭 및 적어도 하나의 다음 순환블럭 내에서 상기 심볼들이 나타나는 빈도수들의 제 2산포도를 연산하여, 상기 제 2산포도가 상기 기준 산포도 이상이면 상기 현재의 순환블럭 및 상기 적어도 하나의 다음 순환블럭을 상기 순환그룹으로서 획득하는 단계를 더 포함하되, 상기 적어도 하나의 다음 순환블럭의 갯수는 상기 제 2산포도가 상기 기준 산포도 이상이 되게 하는 최소 갯수인 것을 특징으로 한다.In the present invention, if the first scattering degree is less than the reference scattering degree, a second scattering degree of the frequencies in which the symbols appear in the current circular block and at least one subsequent circular block is calculated, and if the second scattering degree is smaller than the reference scattering degree The method further comprising: acquiring the current circular block and the at least one next circular block as the circular group, wherein the number of the at least one next circular block is such that the second scatter is greater than or equal to the reference scatter And the minimum number is.

본 발명에서, 상기 산포도로는 상기 빈도수들의 분산, 표준편차, 왜도(skewness) 또는 첨도가 채용되거나, 최소 몇 종의 심볼이 전체 빈도수의 기준 비율 이상을 차지하는지에 대한 지표가 채용될 수 있다.In the present invention, an index of whether the distribution of variance, standard deviation, skewness or kurtosis of the frequency numbers is employed, or whether at least some kinds of symbols occupy more than a reference rate of the total frequency can be employed.

본 발명에서, 상기 복수의 순환그룹 획득 단계 후 남은 잔여그룹의 이진수에 대해서, 상기 압축부가 상기 잔여그룹 내에서 상기 심볼들이 나타나는 빈도수들에 근거하여 허프만 부호화를 실시하여 압축잔여그룹을 획득하는 단계를 더 포함하는 것을 특징으로 한다. In the present invention, for the binary number of the remaining group remaining after the plurality of cyclic group acquisition steps, the compression unit performs Huffman coding based on the frequencies at which the symbols appear in the remaining group to obtain the compressed residual group And further comprising:

본 발명에서, 상기 복수의 압축순환그룹을 생성하는 단계는, 상기 각 순환그룹 내의 각 심볼들의 빈도수에 따라 각 심볼에 대한 허프만 코드를 생성하는 단계; 및 상기 각 순환그룹 내의 각 심볼들을 대응하는 상기 허프만 코드로 치환하는 단계를 포함하는 것을 특징으로 한다.In the present invention, the step of generating the plurality of compression cyclic groups may include generating Huffman codes for each symbol according to the frequency of each symbol in each cyclic group; And replacing each symbol in each of the circulation groups with the corresponding Huffman code.

본 발명에서, 상기 복수의 압축순환그룹을 생성하는 단계는, 각 심볼들에 대응하는 상기 허프만 코드들을 일렬로 이어붙여 생성한 허프만 코드열 및 상기 허프만 코드열 내의 상기 각 허프만 코드의 비트길이를 포함하는 부호화사전을 생성하는 단계를 더 포함할 수 있다.In the present invention, the step of generating the plurality of compression cyclic groups may include a Huffman code sequence generated by concatenating the Huffman codes corresponding to the respective symbols in a row, and a bit length of each of the Huffman codes in the Huffman code sequence And generating a coded dictionary corresponding to the coded word.

본 발명에서, 상기 복수의 압축순환그룹을 생성하는 단계는, 각 순환그룹 내에서의 각 심볼별 출현빈도에 관한 정보를 포함하는 부호화사전을 생성하는 단계를 더 포함할 수 있다.
In the present invention, the step of generating the plurality of compressed cyclic groups may further include generating an encoded dictionary including information on the frequency of occurrence of each symbol in each cyclic group.

또한, 본 발명의 다른 측면에 따르면, 본 발명은 이진데이터 압축방법에 의해 압축된 이진 데이터를 복원장치가 복원하는 방법으로서, 복원부가 각 순환그룹에 대한 상기 부호화사전을 참조하여 상기 복수의 압축순환그룹을 복수의 순환그룹으로 복원하여 이진 데이터를 복원하는 것을 특징으로 하는, 복원장치의 이진데이터 복원방법을 제공한다.
According to another aspect of the present invention, there is provided a method for restoring binary data compressed by a binary data compression method, the restoration unit comprising: a decompression unit that refers to the encoding dictionary for each cyclic group, And restoring the group into a plurality of cyclic groups to restore the binary data.

또한, 본 발명의 또 다른 측면에 따르면, 본 발명은 원본 이진데이터를 제 1방향으로 n비트 단위로 스캐닝하여 n비트길이의 복수의 심볼을 갖는 복수의 순환블럭을 획득하고, 하나의 순환블럭 또는 이웃하는 복수의 순환블럭 내에서 각각의 상기 심볼들이 나타나는 빈도수들의 산포도에 근거하여 적어도 하나의 순환블럭을 각각 포함하는 복수의 순환그룹을 획득하고, 상기 각각의 순환그룹 내의 상기 심볼들에 대해 허프만 부호화를 실시하여 상기 순환그룹 각각을 압축하여 복수의 압축순환그룹을 생성하며, 상기 복수의 압축순환그룹을 결합하여 압축데이터를 생성하는 압축부를 포함하되, 각각의 상기 순환블럭은 상기 원본 이진데이터를 상기 제 1방향으로 순차적으로 스캐닝할 때 상이한 2ⁿ 종류의 상기 심볼이 적어도 한번 이상 모두 나타날 때까지 만나는 심볼들을 갖는 이진수블럭인 것을 특징으로 하는 이진데이터 압축장치를 제공한다.According to still another aspect of the present invention, the present invention provides a method for scanning original binary data in units of n bits in a first direction to obtain a plurality of circular blocks having a plurality of symbols of n bits length, Obtaining a plurality of cyclic groups each including at least one cyclic block based on a scattering degree of frequencies in which each of the symbols appears in a plurality of neighboring cyclic blocks, and performing a Huffman coding for the symbols in each cyclic group And a compression unit for compressing each of the cyclic groups to generate a plurality of compression cyclic groups and generating compressed data by combining the plurality of compression cyclic groups, wherein each of the cyclic blocks includes: until all of the receive more than 2 ⁿ different types of the symbol is at least one when sequentially scanned in a first direction It provides a binary data compression apparatus characterized in that the binary block having symbols meet.

본 발명에서, 상기 순환그룹의 획득시, 상기 압축부는 현재의 순환블럭 내에서 상기 심볼들이 나타나는 빈도수들의 제 1산포도를 연산하고, 상기 제 1산포도를 기준 산포도와 비교하여 상기 제 1산포도가 기준 산포도 이상이면 상기 현재의 순환블럭을 상기 순환그룹으로서 획득하는 것을 특징으로 한다.In the present invention, at the time of acquiring the circulation group, the compression unit calculates a first scattering degree of the frequencies in which the symbols appear in the current circulation block, compares the first scattering degree with a reference scattering degree, The current circulating block is acquired as the circulating group.

본 발명에서, 상기 제 1산포도가 상기 기준 산포도 미만이면, 상기 압축부는 상기 현재의 순환블럭 및 적어도 하나의 다음 순환블럭 내에서 상기 심볼들이 나타나는 빈도수들의 제 2산포도를 연산하여, 상기 제 2산포도가 상기 기준 산포도 이상이면 상기 현재의 순환블럭 및 상기 적어도 하나의 다음 순환블럭을 상기 순환그룹으로서 획득하되, 상기 적어도 하나의 다음 순환블럭의 갯수는 상기 제 2산포도가 상기 기준 산포도 이상이 되게 하는 최소 갯수인 것을 특징으로 한다.In the present invention, when the first scattering degree is less than the reference scattering degree, the compression unit calculates a second scattering degree of the frequencies in which the symbols appear in the current circular block and at least one next circular block, Wherein if the reference scatter is greater than or equal to the reference scatter, the current circular block and the at least one next circular block are obtained as the circular group, and the number of the at least one next circular block is equal to or greater than the minimum number .

본 발명에서, 상기 복수의 순환그룹 획득 후 남은 잔여그룹의 이진수에 대해서, 상기 압축부는 상기 잔여그룹 내에서 상기 심볼들이 나타나는 빈도수들에 근거하여 허프만 부호화를 실시하여 압축잔여그룹을 더 획득할 수 있다.In the present invention, with respect to the binary numbers of the remaining groups remaining after acquiring the plurality of cyclic groups, the compression unit may perform Huffman coding based on the frequencies at which the symbols appear in the remaining group to obtain further compressed residual groups .

본 발명에서, 상기 복수의 압축순환그룹을 생성시, 상기 압축부는 상기 각 순환그룹 내의 각 심볼들의 빈도수에 따라 각 심볼에 대한 허프만 코드를 생성하고, 상기 각 순환그룹 내의 각 심볼들을 대응하는 상기 허프만 코드로 치환하는 것을 특징으로 한다.In the present invention, when generating the plurality of compression cyclic groups, the compression unit generates Huffman codes for each symbol according to the frequency of each symbol in each cyclic group, and outputs each symbol in each cyclic group to the corresponding Huffman Code. &Lt; / RTI >

본 발명에서, 상기 압축부는 각 심볼들에 대응하는 상기 허프만 코드들을 일렬로 이어붙여 생성한 허프만 코드열 및 상기 허프만 코드열 내의 상기 각 허프만 코드의 비트길이를 포함하는 부호화사전을 더 생성할 수 있다.In the present invention, the compression unit may further generate a coding dictionary including a Huffman code string generated by line-by-line the Huffman codes corresponding to the respective symbols, and a bit length of each of the Huffman codes in the Huffman code string .

본 발명에서, 상기 압축부는 각 순환그룹 내에서의 각 심볼별 출현빈도에 관한 정보를 포함하는 부호화사전을 더 생성할 수 있다.
In the present invention, the compression unit may further generate a coding dictionary including information on the frequency of occurrence of each symbol in each cyclic group.

또한, 본 발명의 또 다른 측면에 따르면, 본 발명은 이진데이터 압축장치에 의해 압축된 이진 데이터를 복원하는 장치로서, 각 순환그룹에 대한 상기 부호화사전을 참조하여 상기 복수의 압축순환그룹을 복수의 순환그룹으로 복원하여 이진 데이터를 복원하는 복원부를 포함하는 것을 특징으로 하는, 이진데이터 복원장치를 제공한다.
According to still another aspect of the present invention, there is provided an apparatus for restoring binary data compressed by a binary data compression apparatus, the apparatus comprising: means for referring to the encoding dictionary for each cyclic group, And a restoring unit for restoring the binary data by restoring the binary data to a cyclic group.

본 발명에 따른 이진 데이터의 압축 및 복원 방법과 장치는, 간단한 연산과 하드웨어적 구성을 통해 이진 데이터를 신속하고 효율적으로 압축하고 복원할 수 있고, 압축률도 뛰어나며 압축 데이터 및 복원 데이터의 신뢰성도 높일 수 있을 뿐만 아니라 데이터 전송시 전송효율과 속도도 향상시킬 수 있다. 특히 본 발명에 따르면 허프만 부호화방식의 압축률을 보다 더 효과적으로 높일 수 있다.
The method and apparatus for compressing and restoring binary data according to the present invention are capable of quickly and efficiently compressing and restoring binary data through a simple operation and a hardware configuration, and also have excellent compression rate and reliability of compressed data and restored data Not only the transmission efficiency and the speed of data transmission can be improved. In particular, according to the present invention, the compression ratio of the Huffman coding scheme can be more effectively increased.

도 1은 본 발명에 의한 일 실시예에 따른 이진 데이터의 압축장치 및 복원장치의 구성을 도시한 것이다.
도 2는 본 발명에 의한 일 실시예에 따른 이진 데이터의 압축방법을 설명하기 위한 흐름도이다.
도 3은 허프만 부호화에서 사용되는 이진 트리구조를 나타낸 것이다.
도 4 및 도 5는 각 심볼에 대하여 허프만 코드를 할당하는 것을 설명하기 위한 참고도이다.
도 6은 제 1순환블럭과 제 2순환블럭을 도식적으로 나타낸 것이다.
도 7은 최대 순환블럭 방식에 따라 제 1순환블럭과 제 2순환블럭을 도식적으로 나타낸 것이다.
도 8은 최소 순환블럭 방식에 따라 제 1순환블럭과 제 2순환블럭을 도식적으로 나타낸 것이다.
도 9는 순차적으로 저장된 심볼의 출현빈도 정보에 의해 복원측에서 허프만트리를 구성하여 허프만코드를 심볼별로 찾아내는 과정을 개념적으로 나타낸 것이다.
도 10은 데이터 복원시 허프만 코드열에서 허프만 코드를 분리하여 각 순환그룹별로 심볼별 허프만 부호화사전을 복원하는 방법을 개념적으로 나타낸 것이다.
도 11은 순환그룹 획득부의 작동원리의 개요를 나타낸 것이다.
도 12는 도 10에서 복원된 순환그룹별 심볼별 허프만 부호화사전을 이용하여 실제 압축데이터에서 원 심볼로 데이터를 복원하는 과정을 나타낸 것이다.1 is a block diagram of a binary data compression apparatus and a decompression apparatus according to an embodiment of the present invention.
2 is a flowchart illustrating a method of compressing binary data according to an embodiment of the present invention.
3 shows a binary tree structure used in Huffman coding.
FIGS. 4 and 5 are reference diagrams for explaining allocation of Huffman codes for each symbol.
Figure 6 schematically illustrates a first circulating block and a second circulating block.
FIG. 7 schematically shows a first circulating block and a second circulating block according to a maximum circulating block method.
FIG. 8 schematically shows a first circulating block and a second circulating block according to a minimum circulating block method.
FIG. 9 is a conceptual diagram illustrating a process of constructing a Huffman tree on the restoration side by using the appearance frequency information of sequentially stored symbols to find Huffman codes on a symbol-by-symbol basis.
FIG. 10 conceptually shows a method of separating Huffman codes from a Huffman code string during data restoration and restoring symbol-by-symbol Huffman coding dictionaries for each cyclic group.
11 shows an outline of the operation principle of the circulation group acquisition unit.
FIG. 12 illustrates a process of restoring data from actual compressed data to a source symbol using the Huffman coding dictionary for each symbol of each cyclic group restored in FIG.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고, 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and like parts are denoted by similar reference numerals throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성 요소를 "포함" 한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.
Throughout the specification, when an element is referred to as "comprising ", it means that it can include other elements as well, without excluding other elements unless specifically stated otherwise.

도 1은 본 발명에 의한 일 실시예에 따른 이진 데이터의 압축장치 및 복원장치의 구성을 도시한 것으로서, 이를 참조하여 본 발명에 따른 실시예를 설명하면 다음과 같다.FIG. 1 is a block diagram of a binary data compression apparatus and a decompression apparatus according to an embodiment of the present invention. Referring to FIG. 1, an embodiment according to the present invention will be described below.

도 1에 도시된 바와 같이, 본 실시예에 따른 이진 데이터 압축장치(100)는 압축부(110) 및 출력부(120)를 포함한다. 압축부(110)는 순환그룹 획득부(111)와 압축데이터 생성부(112)를 포함한다.As shown in FIG. 1, the binary data compression apparatus 100 according to the present embodiment includes a compression unit 110 and an output unit 120. The compression unit 110 includes a cyclic group acquisition unit 111 and a compressed data generation unit 112.

먼저, 순환그룹 획득부(111)는 원본 이진데이터를 제 1방향으로 n비트 단위로 스캐닝하여 n비트길이의 복수의 심볼을 갖는 복수의 순환블럭을 획득한다. 그리고, 순환그룹 획득부(111)는 하나의 순환블럭 또는 이웃하는 복수의 순환블럭 내에서 각각의 상기 심볼들이 나타나는 빈도수들의 산포도에 근거하여 적어도 하나의 순환블럭을 각각 포함하는 복수의 순환그룹을 획득한다. 여기서, 각각의 상기 순환블럭은 상기 원본 이진데이터를 상기 제 1방향으로 순차적으로 스캐닝할 때 상이한 2ⁿ 종류의 상기 심볼이 적어도 한번 이상 모두 나타날 때까지 만나는 심볼들을 갖는 이진수블럭이다.First, the cyclic group obtaining unit 111 scans original binary data in units of n bits in a first direction to obtain a plurality of cyclic blocks having a plurality of symbols of n bits length. The cyclic group obtaining unit 111 obtains a plurality of cyclic groups each including at least one cyclic block based on the scattering degree of the frequencies in which the symbols appear in one cyclic block or a plurality of neighboring cyclic blocks do. Herein, each of the circulating blocks is a binary block having symbols that meet until at least one of the 2 ⁿ kinds of symbols appears at least once when the original binary data is sequentially scanned in the first direction.

상기 순환그룹의 획득시, 순환그룹 획득부(111)는 현재의 순환블럭 내에서 상기 심볼들이 나타나는 빈도수들의 제 1산포도를 연산하고, 상기 제 1산포도를 기준 산포도와 비교하여 상기 제 1산포도가 기준 산포도 이상이면 상기 현재의 순환블럭을 상기 순환그룹으로서 획득한다. 한편, 상기 제 1산포도가 상기 기준 산포도 미만이면, 순환그룹 획득부(111)는 상기 현재의 순환블럭 및 적어도 하나의 다음 순환블럭 내에서 상기 심볼들이 나타나는 빈도수들의 제 2산포도를 연산하여, 상기 제 2산포도가 상기 기준 산포도 이상이면 상기 현재의 순환블럭 및 상기 적어도 하나의 다음 순환블럭을 상기 순환그룹으로서 획득한다. 이 때, 상기 적어도 하나의 다음 순환블럭의 갯수는 상기 제 2산포도가 상기 기준 산포도 이상이 되게 하는 최소 갯수인 것을 특징으로 한다.When acquiring the circulation group, the circulation group acquisition unit 111 calculates a first scattering degree of the frequencies in which the symbols appear in the current circulation block, compares the first scattering degree with a reference scattering degree, And acquires the current circulating block as the circulating group if the scattering degree is more than the scattering degree. If the first scattering degree is less than the reference scattering degree, the cyclic group obtaining unit 111 calculates a second scattering degree of the frequencies in which the symbols appear in the current cyclic block and at least one next cyclic block, And if the two scatterplots are equal to or greater than the reference scattering degree, the current circulating block and the at least one next circulating block are obtained as the circulating group. In this case, the number of the at least one next circulating block is a minimum number that makes the second scattering degree equal to or greater than the reference scattering degree.

상기 산포도로는 상기 빈도수들의 분산, 표준편차, 왜도(skewness) 또는 첨도가 채용될 수 있고, 또는 최소 몇 종의 심볼이 전체 빈도수의 기준 비율 이상을 차지하는지에 대한 지표가 채용될 수도 있으며, 빈도수들의 분포를 나타낼 수 있는 다른 다양한 방식의 산포도 산정방식이 적용될 수도 있다. 이에 대한 자세한 설명은 후술한다.The dispersion may be such that variance, standard deviation, skewness, or kurtosis of the frequency numbers may be employed, or an indicator of whether the minimum number of symbols occupies more than the reference frequency of the entire frequency may be employed, May be applied to other various ways of calculating the scattering degree. A detailed description thereof will be described later.

도 11은 순환그룹 획득부의 작동원리의 개요를 나타낸 것이다.11 shows an outline of the operation principle of the circulation group acquisition unit.

압축데이터 생성부(112)는 상기 각각의 순환그룹 내의 상기 심볼들에 대해 허프만 부호화를 실시하여 상기 순환그룹 각각을 압축하여 복수의 압축순환그룹을 생성하며, 상기 복수의 압축순환그룹을 결합하여 압축데이터를 생성한다.The compression data generation unit 112 generates Huffman coding for the symbols in the respective cyclic groups to generate a plurality of compression cyclic groups by compressing each of the cyclic groups, And generates data.

상기 복수의 압축순환그룹을 생성시, 압축데이터 생성부(112)는 상기 각 순환그룹 내의 각 심볼들의 빈도수에 따라 각 심볼에 대한 허프만 코드를 생성하고, 상기 각 순환그룹 내의 각 심볼들을 대응하는 상기 허프만 코드로 치환하여 상기 압축순환그룹을 생성한다. 또한, 압축데이터 생성부(112)는 각 심볼들에 대응하는 상기 허프만 코드들을 일렬로 이어붙여 생성한 허프만 코드열 및 상기 허프만 코드열 내의 상기 각 허프만 코드의 비트길이를 포함하는 부호화사전을 더 생성할 수 있다.When generating the plurality of compression cyclic groups, the compressed data generation unit 112 generates Huffman codes for each symbol according to the frequency of each symbol in each cyclic group, and outputs the corresponding symbols Huffman code to generate the compressed cyclic group. The compressed data generating unit 112 further generates a coding dictionary including a Huffman code string generated by concatenating the Huffman codes corresponding to the respective symbols in a line and a bit length of each Huffman code in the Huffman code string can do.

한편, 상기 복수의 순환그룹 획득 후 남은 잔여그룹의 이진수에 대해서는, 압축부(110)는 상기 잔여그룹 내에서 상기 심볼들이 나타나는 빈도수들에 근거하여 허프만 부호화를 실시하여 압축잔여그룹을 획득하여 최종적인 압축데이터 생성시 적용할 수 있다.On the other hand, with regard to the binary numbers of the remaining groups remaining after acquiring the plurality of cyclic groups, the compression unit 110 performs Huffman coding based on the frequencies at which the symbols appear in the residual group to obtain a compressed residual group, It can be applied when generating compressed data.

출력부(120)는 압축부(110)에 의해 생성된 상기 압축데이터를 이진 데이터 복원장치(200) 등의 목적 장치로 출력한다.The output unit 120 outputs the compressed data generated by the compression unit 110 to a target device such as the binary data restoration device 200 or the like.

또한, 도 1에 도시된 바와 같이, 본 실시예에 따른 이진 데이터 복원장치(200)는 입력부(210) 및 복원부(220)를 포함한다. 입력부(210)는 출력부(120) 등을 통해 전달된 압축데이터를 수신하여 복원부(220)에 전달한다. 1, the apparatus for recovering binary data 200 according to the present embodiment includes an input unit 210 and a decompression unit 220. The input unit 210 receives compressed data transmitted through the output unit 120 and transmits the compressed data to the decompression unit 220.

복원부(220)는 순환그룹 복원부(221)와 결합부(222)를 포함한다.The restoring unit 220 includes a circulating group restoring unit 221 and a combining unit 222.

순환그룹 복원부(221)는 각 순환그룹에 대한 상기 부호화사전을 참조하여 상기 복수의 압축순환그룹을 복수의 순환그룹으로 복원하고, 결합부(222)는 이러한 복수의 순환그룹을 결합하여 이진데이터를 복원한다.
The cyclic group restoring unit 221 restores the plurality of compression cyclic groups into a plurality of cyclic groups by referring to the coding dictionary for each cyclic group, and the combining unit 222 combines the plurality of cyclic groups, .

이와 같이 구성된 본 실시예의 동작 및 작용을 도 1 내지 도 5를 참조하여 구체적으로 설명한다.The operation and operation of this embodiment thus configured will be described in detail with reference to Figs. 1 to 5. Fig.

도 2는 본 발명에 의한 일 실시예에 따른 이진데이터의 압축방법을 설명하기 위한 흐름도이고, 도 3은 허프만 부호화에서 사용되는 이진 트리구조를 나타낸 것이며, 도 4 및 도 5는 각 심볼에 대하여 허프만 코드를 할당하는 것을 설명하기 위한 참고도로서, 이를 참조하여 본 실시예에 따른 이진 데이터의 압축방법을 설명한다.FIG. 2 is a flow chart for explaining a method of compressing binary data according to an embodiment of the present invention. FIG. 3 shows a binary tree structure used in Huffman coding, and FIGS. Code, and a method of compressing binary data according to this embodiment will be described with reference to FIG.

본 실시예에 따른 이진데이터의 압축방법을 살펴 보기에 앞서, 허프만 부호화에 대하여 먼저 살펴 본다.Before explaining the compression method of binary data according to the present embodiment, Huffman coding will be described first.

가령 109,505 byte의 데이터 파일을 예로 들면, 이를 4비트 단위씩 잘라서 그 분포를 확인하면 아래 표 1과 같다. 4비트 단위씩 자르면 이 때 얻어지는 4비트길이의 이진수는 "0000"~"1111" 범위의 값인 십진수 0~15까지 16개(= 2⁴)의 값을 가질 수 있으며, 이를 심볼이라 칭한다. 표 1은 상기 109,505 byte의 데이터 파일 내의 각 심볼의 출현빈도수를 나타낸 것이다.For example, if a data file of 109,505 bytes is taken as an example, it is cut by 4 bits and its distribution is checked as shown in Table 1 below. The 4-bit binary number obtained at this time can have 16 (= 2 ⁴ ) values ranging from "0000" to "1111" ranging from 0 to 15, which is called a symbol. Table 1 shows the appearance frequency of each symbol in the 109,505 byte data file.

4비트심볼4-bit symbol 출현빈도Appearance frequency 00 2443424434 1One 1325513255 22 1352813528 33 1239112391 44 1376513765 55 1290712907 66 1323613236 77 1329813298 88 1263512635 99 1237712377 1010 1325913259 1111 1242012420 1212 1182811828 1313 1295612956 1414 1267312673 1515 1404814048

허프만 부호화(Huffman coding)는 무손실 압축에 사용되는 엔트로피 부호화의 일종으로서, 심볼의 빈도로부터 접두사 코드(prefix code, 어떤 한 심볼의 코드가 다른 심볼 코드의 접두어가 되지 않는 코드)를 만들어 내는 알고리즘인데, 적게 발생하는 심볼(빈도수가 낮은 심볼)일수록 더 긴 코드를 할당하고, 자주 발생하는 심볼(빈도수가 높은 심볼)일수록 더 짧은 코드를 할당한다. 고정길이 심볼에 대하여 가변길이 코드를 생성하는 알고리즘이다.Huffman coding is a type of entropy coding used for lossless compression and is an algorithm for generating a prefix code (a code in which a symbol of one symbol does not become a prefix of another symbol code) from the frequency of a symbol, The less frequently occurring symbol (symbol with lower frequency) allocates a longer code, and the frequent symbol (the symbol with higher frequency) allocates a shorter code. And generates a variable length code for a fixed length symbol.

허프만 코드를 설계하는 방법은 매우 다양한데, 하나의 예시로서, 이진트리(binary tree)를 사용하여 허프만 코드를 생성하는 방법이 있다. 도 3은 허프만 부호화에서 사용되는 이진 트리구조를 나타낸 것으로서, 이진 트리구조에서 외부노드(outer node, external node) 혹은 잎(leaf node)은 이에 대응하는 심볼을 나타낸다. 임의의 심볼에 대한 허프만 코드는 뿌리노드로부터 시작하여 심볼에 대응하는 leaf node까지 내려 오면서, 각각의 가지에 할당된 코드워드를 모두 결합하여 생성된다. There are many ways to design Huffman codes. One example is a method of generating Huffman codes using a binary tree. FIG. 3 shows a binary tree structure used in Huffman coding. In an binary tree structure, an outer node or an outer node or a leaf node represents a corresponding symbol. The Huffman code for an arbitrary symbol is generated by combining all codewords assigned to each branch, starting from the root node and going down to the leaf node corresponding to the symbol.

도 3에서 A심볼의 경우에 뿌리노드(root node)에서부터 inner node를 거쳐 차례로 내려오는 경로인 arm에 할당된 비트를 순차적으로 적용한 "0" 이 허프만 코드이고, 심볼 B의 경우에는 "10", 심볼 C의 경우에는 "110", 심볼 D의 경우에는 "111" 이 허프만 코드가 된다. In FIG. 3, in the case of the A symbol, "0" is sequentially applied to the arm, which is a path descending from the root node through the inner node, in order of Huffman code, "10" Quot; 110 "in the case of the symbol C, and" 111 "

도 3을 참조하여 상기 허프만 트리를 구성하는 방법 중 최소변이 허프만 코드를 하나의 예시로 좀 더 구체적으로 설명하면, 가장 낮은 발생확률 또는 빈도를 갖는 두 심볼을 결합한 후 새로 생성된 두 심볼 간의 발생확률의 합 또는 빈도수의 합을 남아 있는 심볼들의 확률과 다시 비교하여 다시 정렬한 후 코드 생성과정을 반복한다. 다섯 개의 심볼로 구성된 집합 S={A, B, C, D, E}에 대한 최소변이 허프만 코드를 설계해 보면, 각각의 발생 빈도는 A, B, C, D, E 순서대로 2회, 4회, 2회, 1회, 1회라고 하자. 물론 횟수 이외에 발생확률을 이용하여도 계산하여 동일한 계산과정을 따르면 된다.Referring to FIG. 3, a method of constructing the Huffman tree will be described in more detail with reference to an example of a minimum-likelihood Huffman code. The probability of occurrence between two newly generated symbols after combining two symbols having the lowest probability or frequency Is summed again with the probability of the remaining symbols, and then the code generation process is repeated. If we design a minimal variance Huffman code for a set of five symbols S = {A, B, C, D, E} Let's say once, twice, once, and once. Of course, it is also possible to calculate the probability of occurrence in addition to the number of occurrences, and to follow the same calculation procedure.

최소변이 허프만 코드의 시작은 가장 낮은 발생확률 또는 출현 빈도를 갖는 두 심볼 D와 E를 결합하는 것으로부터 시작한다. 그리고, 이 두 심볼에 대응하는 코드워드의 가장 마지막 비트에 서로 다른 비트를 할당한다. 이 때 서로 다른 비트를 할당하는 규칙은, 발생확률 또는 출현빈도가 높은 쪽에 "0"을 적은쪽에 "1"을 할당하고, 물론 그 반대도 가능하다. 그런데 일단 정해지면 그 규칙은 전체 트리생성과정에 일관적으로 적용되어야 복호화과정이 간명해진다.The start of the minimal transition Huffman code begins by combining the two symbols D and E with the lowest probability of occurrence or frequency of occurrence. Different bits are assigned to the last bit of the code word corresponding to these two symbols. In this case, the rule for assigning different bits is such that "1" is assigned to the side where the probability of occurrence or frequency of occurrence is "0", and vice versa. Once determined, however, the rules must be applied consistently throughout the entire tree generation process to simplify the decoding process.

확률 또는 출현빈도가 같으면, 심볼의 정렬순서에 따라 특정 기준으로 선택하여 할당할 수 있다. 본 실시예에서는, 심볼을 오름차순 정렬했을 때 위에 위치하는 심볼에 "0"을 할당하였다. 다만 심볼을 내림차순 정렬을 하면 위에 위치하는 심볼과 아래 위치하는 심볼의 위치가 바뀌어 "0"과 "1"의 할당순서는 달라질 수 있으나, 허프만 트리의 구조는 기본적으로 유지되므로, 부호화 및 복호화 그리고 성능에 영향을 주지는 않는다If the probabilities or appearance frequencies are the same, a specific criterion can be selected and assigned according to the sort order of the symbols. In this embodiment, "0" is assigned to a symbol positioned above when the symbols are arranged in ascending order. However, if the symbols are rearranged in the descending order, the positions of the symbol located above and the symbol positioned below are changed, and the allocation order of "0" and "1 " may be changed. However, since the structure of the Huffman tree is basically maintained, It does not affect

즉 도 4에서는 D에 "0"을 할당하고, E에 "1"을 할당했는데, D에 "1"을 할당하고, E에 "0"을 할당해도 무방하다는 것이다. 다만, 일단 규칙이 내림차순 또는 오름차순으로 정해지면 일관되게 적용하여야 복호화과정이 간명해진다.That is, in FIG. 4, "0" is assigned to D and "1" is assigned to E, and it is also possible to assign "1" to D and "0" to E. However, once the rules are specified in descending order or ascending order, it is necessary to apply them consistently to simplify the decoding process.

다음으로 이 두 심볼(D,E)를 결합하여 새로운 심볼 n1을 만들면 새로운 심볼 n1의 발생빈도는 두 심볼 D와 E의 발생빈도의 합과 같다. 따라서, n1의 빈도는 2이다. 그리고, 생성된 A, B, n1, C 이렇게 4개의 심볼의 발생빈도에 따라 심볼들을 재정렬하되, 재정렬시 새롭게 생성된 심볼의 확률과 동일한 확률 또는 출현빈도를 갖는 기존 심볼이 있는 경우, 새롭게 생성된 심볼을 동일한 확률 또는 출현빈도를 갖는 기존 심볼보다 더 위쪽에 위치시킨다. 이것이 일반 허프만 코드와 최소변이 허프만 코드의 다른점이다.Next, when the two symbols D and E are combined to form a new symbol n1, the occurrence frequency of the new symbol n1 is equal to the sum of the occurrence frequencies of the two symbols D and E. Therefore, the frequency of n1 is 2. If there is an existing symbol having the same probability or appearance frequency as the probability of a newly generated symbol at the time of reordering the symbols according to the frequency of occurrence of the four symbols A, B, n1 and C, Place the symbol above the existing symbol with the same probability or frequency of occurrence. This is the difference between a regular Huffman code and a minimal-variant Huffman code.

상기와 같은 과정을 요약하면 도 5와 같다. 각 leaf node로 내려 오는 경로의 부호의 합이 바로 각 심볼당 허프만 코드가 된다. 표 2는 이러한 예를 실제 바이너리 클러스터(이진데이터)의 각 심볼에 적용하여 허프만 코드를 생성한 결과를 나타낸 것이다.The above process is summarized in FIG. The sum of the signs of the paths down to each leaf node is Huffman code per symbol. Table 2 shows the result of applying this example to each symbol of an actual binary cluster (binary data) to generate a Huffman code.

심볼symbol 출현빈도Appearance frequency 허프만 코드Huffman code AA 22 1010 BB 44 0000 CC 22 1111 DD 1One 010010 EE 1One 011011

본 실시예에서는 허프만 부호화방법으로서 최소변이 트리를 이용한 방식을 주로 사용했으나, 일반적 허프만 트리, 적응적 허프먼 부호화 등 매우 다양한 공지의 허프만 트리생성방법을 이용하여 부호화하고 복호화할 수 있음은 물론이다. In the present embodiment, a method using a minimum mutation tree is mainly used as a Huffman coding method. However, it is needless to say that it is possible to perform coding and decoding using a very wide variety of known Huffman tree generation methods such as general Huffman tree and adaptive Huffman coding.

한편 이렇게 최소변이 Huffman tree 구현법에 따라 부호화한 압축이후의 크기를 아래 표2에 보인다. Table 2 shows the size of the minimum displacement after encoding according to the Huffman tree implementation.

상기 표 1에서 설명한 109,505 byte의 데이터 파일의 각 심볼에 대해 허프만 부호화를 수행하여 부호화하면 표 3과 같다.Table 3 shows the coding of each symbol of the 109,505 byte data file described in Table 1 by performing Huffman coding.

4비트 심볼4-bit symbol 출현빈도Appearance frequency 허프만코드Huffman code 압축크기Compression size 원본크기Original size 00 2443424434 000000 7330273302 9773697736 1One 1325513255 10011001 5302053020 5302053020 22 1352813528 11001100 5411254112 5411254112 33 1239112391 00100010 4956449564 4956449564 44 1376513765 11011101 5506055060 5506055060 55 1290712907 01100110 5162851628 5162851628 66 1323613236 10001000 5294452944 5294452944 77 1329813298 10111011 5319253192 5319253192 88 1263512635 01000100 5054050540 5054050540 99 1237712377 1111111111 6188561885 4950849508 1010 1325913259 10101010 5303653036 5303653036 1111 1242012420 00110011 4968049680 4968049680 1212 1182811828 1111011110 5914059140 4731247312 1313 1295612956 01110111 5182451824 5182451824 1414 1267312673 01010101 5069250692 5069250692 1515 1404814048 11101110 5619256192 5619256192 총합total 875811875811 876040876040

상기 표 3에서 알 수 있는 바와 같이, 경우에 따라서는 허프만 부호화 사전을 제외한 실제 압축된 데이터가 229 bit(=876040-875811)에 불과할 정도도 그 압출률이 상대적으로 많이 낮을 수 있다. 이에 이하에서 설명하는 본 발명에 따른 실시예는 이렇게 전체적인 심볼별 출현 빈도수에 따른 허프만 부호화의 결과가 좋지 않은 경우에도 이를 보다 개선하여 압출률을 효과적으로 높일 수 있는 바, 이를 자세히 설명한다.As can be seen from Table 3, in some cases, the extrusion rate may be relatively low even if the actual compressed data excluding the Huffman coding dictionary is only 229 bits (= 876040-875811). Thus, the embodiment according to the present invention described below can improve the extrusion efficiency even more if the Huffman coding result according to the frequency of occurrence of symbols is not good, thereby effectively increasing the extrusion rate.

먼저, 도 2에 도시된 바와 같이, 압축부(100)의 순환그룹 획득부(111)는 원본 이진데이터를 제 1방향으로 n비트 단위로 스캐닝하여 n비트길이의 복수의 심볼을 갖는 복수의 순환블럭을 획득한다(S201). 즉, 순환그룹 획득부(111)는 원본 이진데이터를 최상위비트에서 하위비트 방향으로 또는 그 반대방향으로 n비트 단위로 스캐닝한다. 여기서 n은 임의의 자연수로서, 가령 n=4라면 4비트 길이의 복수의 심볼들, 즉 "0000"으로부터 "1111"까지의 16(= 2⁴) 종류의 심볼들이 스캐닝되어 얻어진다. 물론 n=6이라면 6비트 길이의 복수의 심볼들, 즉 "000000"으로부터 "111111"까지의 64(= 2⁶) 종류의 심볼들이 스캐닝되어 얻어진다. 2, the cyclic group obtaining unit 111 of the compression unit 100 scans the original binary data in units of n bits in the first direction to obtain a plurality of cyclic groups having a plurality of symbols of n bits length Block (S201). That is, the cyclic group obtaining unit 111 scans the original binary data in units of n bits from the most significant bit to the lower bit direction or vice versa. Here, n is an arbitrary natural number. For example, if n = 4, a plurality of symbols having a length of 4 bits, that is, 16 (= 2 ⁴ ) kinds of symbols from "0000" to "1111" are scanned. Of course, if n = 6, 64 (= ²⁶ ) kinds of symbols of 6 bits length, i.e., "000000" to "111111" are obtained by scanning.

상기 순환블럭은 상기 원본 이진데이터를 상기 제 1방향으로 순차적으로 스캐닝할 때 상이한 2ⁿ 종류의 상기 심볼이 적어도 한번 이상 모두 나타날 때까지 만나는 심볼들을 갖는 이진수블럭이다. 즉, 최상위비트부터 하위비트 방향으로 스캐닝하는 경우, n=4라면 순환블럭은 "0000"으로부터 "1111"까지의 16(= 2⁴) 종류의 심볼들이 모두 나타날 때까지 스캐닝되는 모든 심볼들을 갖는 이진수블럭이 된다. 이 경우 하나의 순환블럭 내에서 각각의 심볼들이 나타나는 빈도수는 다양할 수 있는데, 가령 0000은 한번, 0001은 5번...1111은 10번 나타나는 등으로 다양하게 나타날 수 있다. 순환블럭을 획득하는 상기와 같은 과정은 원본 이진데이터의 마지막까지 계속 반복하여 수행된다.The circular block is a binary block having symbols that meet until at least one of the 2 ⁿ kinds of symbols appears at least once when the original binary data is sequentially scanned in the first direction. That is, in the case of scanning from the most significant bit to the lower bit direction, if n = 4, the circular block is a binary number of all symbols scanned until 16 (= 2 ⁴ ) kinds of symbols from "0000" Block. In this case, the frequency at which each symbol appears in one cyclic block may vary. For example, 0000 is once, 0001 is 5 times, 1111 is 10 times, and so forth. The above process of acquiring the circular block is repeatedly performed until the end of the original binary data.

다음으로, 순환그룹 획득부(111)는 하나의 순환블럭 또는 이웃하는 복수의 순환블럭 내에서 각각의 상기 심볼들이 나타나는 빈도수들의 산포도에 근거하여 적어도 하나의 순환블럭을 각각 포함하는 복수의 순환그룹을 획득한다(S202). 상기 산포도로는 상기 빈도수들의 분산, 표준편차, 왜도(skewness) 또는 첨도가 채용될 수 있고, 또는 최소 몇 종의 심볼이 전체 빈도수의 기준 비율 이상을 차지하는지에 대한 지표가 채용될 수도 있으며, 빈도수들의 분포를 나타낼 수 있는 다른 다양한 방식의 산포도 산정방식이 적용될 수도 있다. 이러한 지표들은 출현 빈도수의 값에 대한 불균등 분포의 정도를 나타낸다는 점에서 공통된다.Next, the cyclic group obtaining unit 111 obtains a plurality of cyclic groups each including at least one cyclic block based on the scattering degree of the frequencies at which the respective symbols appear in one cyclic block or a plurality of neighboring cyclic blocks (S202). The dispersion may be such that variance, standard deviation, skewness, or kurtosis of the frequency numbers may be employed, or an indicator of whether the minimum number of symbols occupies more than the reference frequency of the entire frequency may be employed, May be applied to other various ways of calculating the scattering degree. These indicators are common in that they indicate the degree of uneven distribution over the value of the frequency of appearance.

상기에서 "최소 몇 종의 심볼이 전체 빈도수의 기준 비율 이상을 차지하는지에 대한 지표"라는 개념은, 가장 큰 빈도수부터 크기 순서대로 합산했을 때, 최소 몇 종의 심볼의 빈도가 합산되면 전체 빈도수의 특정 비율 이상을 차지하는지에 대한 지표이다. 예를 들어, 기준 비율이 전체 빈도수의 50%라고 하면, 전체 빈도의 50%이상을 차지하는 최소의 심볼의 종류의 개수이다. 이것이 작을수록 특정 몇 종류의 심볼이 전체적으로 많은 비중을 차지하고 있는 것으로서 일종의 산포도의 역할을 수행할 수 있다.The concept of "index of at least some kinds of symbols occupying more than the reference rate of the total number of frequencies" means that when the frequencies of the smallest number of symbols are summed in order from the largest frequency to the largest, Of the total population. For example, if the reference rate is 50% of the total frequency, it is the minimum number of types of symbols occupying 50% or more of the total frequency. The smaller the number of symbols, the greater the proportion of a certain number of specific symbols.

이와 관련하여 두가지 설정이 가능하다. 전체 빈도수의 A %라는 부분과 최소 B 종류를 특정하고, A, B를 만족하는 심볼 그룹의 심볼들이 이를 넘어서면 허프만 부호화를 수행하는 형태이다.Two settings are possible in this regard. A part of the total frequency and a minimum B type are specified, and when the symbols of the symbol group satisfying A and B are exceeded, Huffman coding is performed.

예를 들어, 표 4와 같이, 0~255까지의 256종의 심볼이 전체 33048개(지면관계상 일부 생략) 출현한 순환그룹에 있어서, 전체 50% 의 빈도에 해당하는 빈도의 합산값은 15024개이다.For example, as shown in Table 4, in the case of a circulation group in which 330 symbols (a part of which is abbreviated due to ground) appear in 256 symbols from 0 to 255, the sum of the frequencies corresponding to the frequencies of 50% Dog.

심볼symbol 출현빈도Appearance frequency 00 272272 1One 131131 22 8585 33 164164 44 137137 55 100100 66 101101 77 138138 88 112112 99 9595 1010 107107 240240 8282 241241 8181 242242 134134 243243 9797 244244 9090 245245 109109 246246 140140 247247 8484 248248 6868 249249 112112 250250 9393 251251 9595 252252 101101 253253 8585 254254 9595 255255 9999

표 4를 출현 빈도수대로 내림차순 정렬하고, 차례대로 합산해 봤을 때, 최초로 15024번이 넘는가를 보면, 아래 표 5와 같이 최소 80종이 15131개로서 처음으로 전체 50%이상을 차지하고 있음을 알 수 있는데, 이 80이란 값이 작을수록 보다 심볼들이 특정 종류에 밀집되어 분포하고 있음을 알 수 있는 것이다.As shown in Table 5 below, the minimum number of 80 species is 15,131, which is the first 50% or more of the total, when Table 4 is sorted in descending order according to the frequency of appearance, As the value of 80 is smaller, it can be seen that more symbols are distributed in a specific kind.

심볼symbol 출현빈도Appearance frequency 138138 979979 4040 960960 162162 959959 6969 431431 8181 420420 2020 397397 00 272272 3636 211211 146146 196196 9292 195195 7373 177177 183183 170170 5656 167167 7070 167167 113113 166166 33 164164 7272 164164 9191 163163 200200 162162 219219 162162 140140 160160 149149 160160 2828 159159 227227 157157 105105 156156 218218 155155 130130 154154 142142 153153 145145 153153 2525 152152 3232 151151 199199 151151 5050 150150 198198 150150 9090 149149 181181 146146 228228 146146 3535 145145 5757 144144 107107 143143 115115 143143 2424 142142 237237 142142 114114 141141 214214 141141 9999 140140 158158 140140 231231 140140 246246 140140 100100 139139 77 138138 8484 138138 44 137137 103103 137137 110110 137137 121121 137137 201201 137137 204204 137137 207207 137137 4848 135135 7979 135135 112112 135135 182182 135135 2121 134134 206206 134134 242242 134134 109109 133133 134134 133133 150150 132132 192192 132132 1One 131131 5959 131131 147147 131131 222222 131131 1212 130130 1414 130130 3030 130130 4444 129129 210210 129129 7575 128128

단계(S202)를 좀 더 구체적으로 살펴 보면, 순환그룹 획득부(111)는 현재의 순환블럭 내에서 상기 심볼들이 나타나는 빈도수들의 제 1산포도를 연산하고, 산출된 제 1산포도를 기준 산포도와 비교한다. 분산이나 표준편차 등의 산포도는 상기 빈도수들의 분포 정도(변수들의 흩어진 정도)를 나타내는 값으로서, 그 값이 크면 빈도수들 간의 편차가 커서 빈도수가 상대적으로 큰 값과 작은 값이 혼재되어 있다는 것을 의미한다. 가령, 특정 심볼 A의 빈도수는 상당히 크고 특정 심볼 B의 빈도수는 상대적으로 작는 등, 산포도가 크면 빈도수들이 평균값을 중심으로 하여 많이 흩어져 분포하기 때문에 순환블럭 내에서 특정 심볼(들)이 편중되어 많이 나타나게 된다. 따라서, 다른 심볼들에 비하여 출현 빈도수가 상대적으로 많이 높은 심볼들에 대해 압축을 하면 데이터 압출률을 높일 수가 있다.More specifically, the cyclic group obtaining unit 111 calculates a first scattering degree of the frequencies in which the symbols appear in the current cyclic block, and compares the calculated first scattering degree with a reference scattering degree . The scattering degree such as dispersion or standard deviation is a value representing the degree of distribution of the frequency numbers (degree of scattering of the variables), and if the value is large, the deviation between the frequencies is large, meaning that the frequency is relatively large and small. . For example, if the frequency of a specific symbol A is considerably large and the frequency of a specific symbol B is relatively small, etc., a large number of frequency distributions are scattered around the average value, so that a specific symbol (s) do. Accordingly, compression of symbols having a relatively high frequency of appearance compared to other symbols can increase the data extrusion rate.

상기 비교 결과 상기 제 1산포도가 기준 산포도 이상이면, 순환그룹 획득부(111)는 상기 현재의 순환블럭을 상기 순환그룹으로서 획득한다. 여기서 순환그룹은 이하에서 설명할 허프만 부호화 작업을 실시하기 위한 기준이 되는 심볼그룹이다. 현재의 순환블럭의 산포도인 제 1산포도가 기준 산포도 이상이면, 이 순환블럭에서는 심볼들의 빈도수 값들이 많이 흩어져 분포하기 때문에 특정 심볼들이 집중적으로 많이 나타난다는 것을 의미한다. 따라서, 이러한 순환블럭에 대해서는 허프만 부호화 등에 의한 압축효율이 상대적으로 많이 높아지므로, 해당 순환블럭을 순환그룹으로서 획득하는 것이다.If the first scattering degree is equal to or greater than the reference scattering degree as a result of the comparison, the circulating group acquiring unit 111 acquires the current circulating block as the circulating group. Here, the cyclic group is a symbol group serving as a criterion for performing the Huffman encoding operation described below. If the first scattering degree, which is the scattering degree of the current cyclic block, is equal to or greater than the reference scattering degree, the frequency values of the symbols are scattered and distributed in the cyclic block, meaning that specific symbols are concentrated. Therefore, since the compression efficiency by Huffman encoding becomes relatively high with respect to such a cyclic block, the corresponding cyclic block is acquired as a cyclic group.

반면, 제 1산포도가 상기 기준 산포도 미만이면, 순환그룹 획득부(111)는 현재의 순환블럭만으로는 순환그룹을 만들지 않고 후속하는 다음 순환블럭도 고려하여 산포도를 연산한다. 즉, 순환그룹 획득부(111)는 현재의 순환블럭 및 다음 순환블럭 내에서 상기 심볼들이 나타나는 빈도수들의 제 2산포도를 연산하여 이를 상기 기준 산포도와 비교한다. 상기 비교 결과 상기 제 2산포도가 기준 산포도 이상이면, 순환그룹 획득부(111)는 상기 현재의 순환블럭과 다음 순환블럭을 상기 순환그룹으로서 획득한다. 현재 순환블럭과 다음 순환블럭의 산포도인 제 2산포도가 기준 산포도 이상이면, 이 두 순환블럭에서는 특정 심볼들이 집중적으로 많이 나타난다고 볼 수 있어 허프만 부호화 등에 의한 압축효율이 상대적으로 많이 높아지므로, 해당 2개의 순환블럭을 순환그룹으로서 획득하는 것이다.On the other hand, if the first scattering degree is less than the reference scattering degree, the circulation group obtaining unit 111 does not create the circulation group by only the current circulation block but calculates the scattering degree considering the following next circulation block. That is, the cyclic group obtaining unit 111 calculates a second scattering degree of the frequencies in which the symbols appear in the current cyclic block and the next cyclic block, and compares the second scattering degree with the reference scatterplot. If the second scattering degree is equal to or greater than the reference scattering degree, the circulating group obtaining unit 111 obtains the current circulating block and the next circulating block as the circulating group. If the second scattering degree which is the scattering degree of the current cyclic block and the next cyclic block is equal to or greater than the reference scattering degree, it can be seen that the specific symbols are concentrated in the two cyclic blocks, so that the compression efficiency by Huffman coding becomes relatively high, Lt; / RTI > as a cyclic group.

한편, 제 2산포도도 상기 기준 산포도 미만이면, 순환그룹 획득부(111)는 또 그 다음 순환블럭까지 고려하여 산포도를 구하여 기준 산포도와 비교하는 등 상술한 과정을 계속적으로 반복하여 수행하여 순환그룹을 획득한다. 따라서, 얻어지는 순환그룹은 하나의 순환블럭만으로 이루어질 수도 있지만 2개, 3개, 5개 등 데이터의 특성에 따라 다양한 갯수의 순환블럭으로 이루어질 수도 있다. 다만, 순환그룹을 형성하는 순환블럭들의 갯수는 해당 순환블럭들의 산포도가 상기 기준 산포도 이상이 되게 하는 최소 갯수인 것을 특징으로 하므로, 일단 일정 갯수의 순환블럭(들)이 기준 산포도 이상이 되어 조건을 만족하면 다음 순환블럭은 고려하지 않고 순환그룹을 획득한다.On the other hand, if the second scattering degree is also less than the reference scattering degree, the circulation group obtaining unit 111 repeatedly performs the above-described process such that the scattering degree is calculated considering the next circulating block and compared with the reference scattering degree, . Therefore, the obtained cyclic group may be composed of only one cyclic block, but it may be composed of various numbers of cyclic blocks according to the characteristics of data such as 2, 3, 5, and so on. However, since the number of the circulating blocks forming the circulating group is the minimum number that makes the scattering degree of the corresponding circulating blocks be greater than the reference scattering degree, once a certain number of the circulating block (s) becomes greater than the reference scattering degree, If satisfied, it acquires the cyclic group without considering the next cyclic block.

다음으로, 압축데이터 생성부(112)는 상기 각각의 순환그룹 내의 상기 심볼들에 대해 허프만 부호화를 실시하여 상기 순환그룹 각각을 압축하여 복수의 압축순환그룹을 생성한다(S203). 구체적으로, 압축데이터 생성부(112)는 상기 각 순환그룹 내의 각 심볼들의 출현 빈도수에 따라 각 심볼에 대한 허프만 코드를 생성하고, 상기 각 순환그룹 내의 각 심볼들을 대응하는 상기 허프만 코드로 치환한다. 상술한 바와 같이, 허프만 부호화방식은 빈도수가 높은 심볼들에 대해서는 상대적으로 간단한 허프만 코드를 할당하고 빈도수가 낮은 심볼들에 대해서는 상대적으로 복잡한 허프만 코드를 할당한다. 이와 같이, 각 순환그룹 내의 각 심볼들에 대하여 대응하는 허프만 코드를 할당하여 치환함으로써 각 순환그룹은 효과적으로 압축될 수 있다.Next, the compressed data generation unit 112 performs Huffman encoding on the symbols in the respective cyclic groups, and compresses the cyclic groups to generate a plurality of compressed cyclic groups (S203). Specifically, the compressed data generation unit 112 generates a Huffman code for each symbol according to the appearance frequency of each symbol in each of the cyclic groups, and replaces each symbol in each cyclic group with the corresponding Huffman code. As described above, in the Huffman coding scheme, a relatively simple Huffman code is assigned to symbols having a high frequency and a relatively complex Huffman code is assigned to symbols having a low frequency. Thus, each cyclic group can be effectively compressed by assigning and replacing the corresponding Huffman codes for each symbol in each cyclic group.

또한 압축순환그룹의 생성시, 압축데이터 생성부(112)는 각 심볼들에 대응하는 상기 허프만 코드들을 일렬로 이어붙여 생성한 허프만 코드열 및 상기 허프만 코드열 내의 상기 각 허프만 코드의 비트길이(DIC_SIZE)를 포함하는 부호화사전을 함께 생성한다. 물론, 공지의 다른 부호화사전 생성방법을 이용할 수도 있다. 원본 이진데이터의 각각의 순환그룹에 대하여 허프만 부호화를 수행하면 압축순환그룹이 생성되어 데이터가 압축되지만, 이후 원본 이진데이터를 복원하기 위해서는 어떤 심볼에 대해 어떤 허프만 코드가 할당되었는지에 대한 정보를 알고 있어야 한다. 따라서, 각 심볼들에 대응하는 허프만 코드들을 일렬로 이어붙여 허프만 코드열을 생성하고, 이 허프만 코드열에 포함되어 있는 각 허프만 코드의 비트길이 및 상기 허프만 코드열을 이용하여 복호화를 위한 부호화사전을 만드는 것이다.When the compressed cyclic group is generated, the compressed data generation unit 112 generates a Huffman code string generated by concatenating the Huffman codes corresponding to the respective symbols in a row, and a bit length (DIC_SIZE) of each Huffman code in the Huffman code string ). &Lt; / RTI > Of course, other known encoding dictionary generation methods may be used. If the Huffman coding is performed on each cyclic group of the original binary data, a compression cyclic group is created and the data is compressed. However, in order to restore the original binary data, it is necessary to know information about which Huffman codes are allocated to which symbols do. Therefore, Huffman codes corresponding to the respective symbols are connected in a row to generate a Huffman code string, and a bit length of each Huffman code included in the Huffman code string and an encoding dictionary for decoding using the Huffman code string will be.

예를 들어, 비트길이 n=4인 심볼들을 이용하여 압축을 할 때, 어떤 순환그룹 내에서 0000, 0001, 0010, 0011,..., 1111까지의 심볼들에 대해 허프만 코드 000, 11111, 1100, 0010,..., 11110이 각각 대응되는 경우, 상기 허프만 코드들을 이어붙여 만들어지는 000/11111/1100/0010/.../11110와 같은 허프만 코드열 및 이 코드열 내의 각 허프만 코드의 비트길이에 대한 정보(3,5,4,4,...5)를 포함하는 부호화사전을 함께 생성한다. 이러한 부호화사전 생성은 각 순환그룹마다 이루어진다.For example, when compressing using symbols with a bit length of n = 4, Huffman codes 000, 11111, and 1100 for symbols 0000, 0001, 0010, 0011, ..., 1111 in a certain cyclic group , 0010, ..., 11110 correspond to each other, a Huffman code string such as 000/11111/1100/0010 /.../11110 made by concatenating the Huffman codes and a bit of each Huffman code in this code string 4, ..., 5) for the length of the data. This encoding dictionary creation is performed for each cyclic group.

한편, 상기 복수의 순환그룹 획득 단계 후 남은 잔여그룹의 이진수가 있는 경우, 즉 순환그룹 획득 단계(S202) 후 순환그룹으로 되지 않고 남은 잔여 이진수부분(잔여그룹)이 있는 경우에는, 압축데이터 생성부(112)는 상기 잔여그룹 내에서 상기 심볼들이 나타나는 빈도수들에 근거하여 허프만 부호화를 실시하여 압축잔여그룹을 획득할 수 있다. 물론, 실시예에 따라서는 잔여 그룹의 이진수에 대해서는 압축을 수행하지 않고 그대로 전송할 수도 있을 것이다.On the other hand, if there are binary numbers of remaining groups remaining after the plurality of cyclic group acquiring steps, that is, if there are residual binary numbers (remaining groups) remaining after not being a circulating group after the cyclic group acquiring step (S202) The encoding unit 112 may perform Huffman encoding based on the frequencies at which the symbols appear in the residual group to obtain a compressed residual group. Of course, depending on the embodiment, the binary number of the remaining group may be transmitted without compression.

다음으로, 압축데이터 생성부(112)는 상기에서 생성된 복수의 압축순환그룹을 결합하여 압축데이터를 생성하여 출력부(120)를 통해 복원장치(200) 등의 목적장치로 출력한다(S204). 만약 상술한 바와 같이 압축잔여그룹이 있는 경우에는 이것도 포함하여 압축데이터를 생성한다. 이 때, 압축장치(100)는 상기 생성한 각 순환그룹에 대한 부호화사전을 압축데이터와 함께 출력할 수 있으며, 또한 압축데이터와는 별도의 채널을 통해 출력하여 복원장치(200) 등의 목적 장치에 전송할 수도 있다.Next, the compressed data generation unit 112 combines the plurality of compression cyclic groups generated in the above-described manner to generate compressed data, and outputs the compressed data to the destination apparatus such as the restoration apparatus 200 through the output unit 120 (S204) . If there is a compressed residual group as described above, compressed data is also generated including the compressed residual group. At this time, the compression apparatus 100 can output the encoding dictionary for the generated cyclic groups together with the compressed data, and outputs the encoded dictionaries through a channel separate from the compressed data, As shown in FIG.

상기와 같은 과정을 통해 이진데이터가 압축되어 출력되면, 이진데이터 복원장치(200)는 입력부(210)를 통해 상기 압축데이터를 수신하여 복원부(220)에 전달한다. 도 1에 도시된 바와 같이, 복원부(220)는 순환그룹 복원부(221)와 결합부(222)를 포함한다. 순환그룹 복원부(221)는 각 순환그룹에 대한 상기 부호화사전을 참조하여 상기 복수의 압축순환그룹을 복수의 순환그룹으로 복원하고, 결합부(222)는 이러한 복수의 순환그룹을 결합하여 이진데이터를 복원한다.When the binary data is compressed and output through the above process, the binary data decompression apparatus 200 receives the compressed data through the input unit 210 and transmits the compressed data to the decompression unit 220. As shown in FIG. 1, the restoration unit 220 includes a circulation group restoration unit 221 and a combining unit 222. The cyclic group restoring unit 221 restores the plurality of compression cyclic groups into a plurality of cyclic groups by referring to the coding dictionary for each cyclic group, and the combining unit 222 combines the plurality of cyclic groups, .

본 실시예에서는, 원본 이진데이터를 스캐닝하여 복수의 순환블럭을 획득한 후 순환그룹을 획득하고 각 순환그룹에 대하여 허프만 부호화를 수행하여 압축하는 것으로 기재하였으나, 본 발명은 이에 한정하지 않고 원본 이진데이터를 특정 방향으로 순차적으로 스캐닝하여 순환블럭을 획득하면서 순환그룹을 획득하고, 그 획득된 순환그룹에 대해서는 바로 허프만 부호화를 수행하여 압축을 수행하는 경우도 포함한다. 즉, 본 발명은, 원본 이진데이터 전체에 대해 스캐닝이 완료되거나 전체에 대해 순환블럭이 획득되기 전이라 하더라도, 순차적으로 획득된 하나 이상의 순환블럭이 순환그룹 형성 조건을 만족하면 바로 순환그룹으로서 얻고 또한 그 얻어진 순환그룹에 대해서는 바로 허프만 부호화를 수행하는 경우도 포함한다.
In the present embodiment, it is described that the original binary data is scanned to acquire a plurality of cyclic blocks, and a cyclic group is acquired and Huffman coding is performed for each cyclic group. However, the present invention is not limited to this, To obtain a cyclic group while acquiring a cyclic block, and performing compression by performing Huffman coding immediately on the obtained cyclic group. That is, according to the present invention, even if scanning is completed for the entire original binary data or a circular block is not obtained for all of the original binary data, when at least one circular block obtained sequentially is satisfied as a circular group forming condition, The obtained cyclic group includes the case of directly performing Huffman coding.

상술한 본 실시예에 따른 이진데이터 압축방법 및 복원방법을 좀 더 구체적인 실시예를 통해 설명하면 다음과 같다.The binary data compression method and the decompression method according to the present embodiment will be described in more detail with reference to the following embodiments.

임의의 원본 이진데이터를 선택하여 본 실시예에 따른 압축을 수행하였다. 먼저 4비트로 구성된 심볼들을 최상위비트로부터 순차적으로 읽어 오면서 모든 출현가능한 심볼들을 최초로 1번이상 만났을 경우를 제1순환 블럭이라고 한다. 예를 들어 본 실시예에서는, 101번째 데이터인 10(이진수 "1010")을 만나면 0~15까지의 전체 심볼이 모두 1번이상씩 만나게 되는데, 100번째 데이터까지 읽은 후의 집계결과를 표 6에, 다음 데이터인 101번째 데이터까지 읽은 후의 집계결과를 표 7에 보인다.Any original binary data was selected to perform compression according to the present embodiment. First, when a symbol consisting of 4 bits is sequentially read from the most significant bit and all the possible symbols are first encountered one or more times, it is referred to as a first circular block. For example, in the present embodiment, when the 101st data 10 (binary number "1010") is encountered, all the symbols from 0 to 15 are encountered at least once. Table 6 shows the result of counting after reading the 100th data, Table 7 shows the result of counting after reading up to the 101st data, which is the next data.

심볼symbol 출현빈도Appearance frequency 00 3333 1One 55 22 44 33 66 44 77 55 88 66 99 77 77 88 55 99 1One 1111 22 1212 33 1313 44 1414 44 1515 22 총합계total 100100

심볼symbol 출현빈도Appearance frequency 00 3333 1One 55 22 44 33 66 44 77 55 88 66 99 77 77 88 55 99 1One 1010 1One 1111 22 1212 33 1313 44 1414 44 1515 22 총합계total 101101

상기 표 6과 표 7에서 보듯이 1번째부터 시작하여 101번째 데이터 10(이진수로 "1010")을 만남으로써, 0~15까지의 모든 출현가능한 심볼을 1번이상 처음으로 만날 수가 있다. 이렇게 함으로써 첫번째 순환블럭인 제 1순환블럭이 얻어진다.As shown in Tables 6 and 7, by encountering the 101st data 10 ("1010" in binary number) starting from the 1st, all the possible symbols 0 to 15 can be met for the first time more than once. By doing so, a first circular block, the first circular block, is obtained.

제 2순환은, 101번째까지의 제1순환까지의 빈도의 분포를 구분한 채, 102번째부터 계속 읽기 시작하여, 개념적으로 다시 0~15 까지의 모든 출현가능한 심볼을 1번이상 처음으로 만날 때까지를 의미한다. 본 실시예에서는 102번째 데이터부터 ~ 1082번째 데이터가 제2순환 블럭을 구성한다.In the second cycle, the distribution of the frequency up to the 101st first cycle is divided and the continuous reading starts from the 102nd time, and conceptually again all occurrences of symbols 0 to 15 are encountered for the first time more than once . In this embodiment, the 102nd data through 1082st data constitute the second circular block.

표 8에는, 102번째부터 ~ 1181번째 데이터를 읽은 결과의 분포를 보이고, 표 9는 102번째부터~1182번째 데이터를 읽은 결과의 분포를 보인다.Table 8 shows the distribution of the results of reading the data from the 102nd to the 1181st data, and Table 9 shows the distribution of the results of reading the data from the 102nd to the 1182rd data.

심볼symbol 출현빈도Appearance frequency 00 10351035 1One 33 22 55 33 44 44 33 55 1One 66 55 77 44 88 22 99 22 1010 44 1111 33 1212 44 1313 33 1515 22 총합계total 10801080

심볼symbol 출현빈도Appearance frequency 00 10351035 1One 33 22 55 33 44 44 33 55 1One 66 55 77 44 88 22 99 22 1010 44 1111 33 1212 44 1313 33 1414 1One 1515 22 총합계total 10811081

상기 표 8 및 표 9에 보듯이, 1182번째의 14를 만남으로써 0~15까지의 모든심볼을 다시 처음으로 1번이상 만나게 됨을 알 수 있다. As shown in Tables 8 and 9, it can be seen that all the symbols from 0 to 15 meet again for the first time by meeting 14 of the 1182th.

도 6은 제 1순환블럭과 제 2순환블럭을 도식적으로 나타낸 것이다.Figure 6 schematically illustrates a first circulating block and a second circulating block.

한편 이렇게 순환블럭을 구분하는 방식의 한가지 추가적인 변형으로서, 나타날 수 있는 모든 심볼을 처음으로 1번 만날 때, 그 위치의 심볼까지(예를 들면 상기 그림에서는 101번째, 1182 번째 심볼,...)만 포함하여 하나의 순환블럭을 구성할 수도 있으나, 뒤이어서 같은 심볼이 연속하여 나타는 경우까지를 포함하여 하나의 순환블럭을 구성할 수도 있다. 전자의 방식을 본 실시예에서는 최소순환블럭 구성방식이라고 하고, 후자의 방식을 최대순환블럭 구성방식이라고 명명하기로 하고, 본 실시예에서는 두가지 모두 사용할 수가 있으며, 설정으로 특정 하나를 선택할 수 있다.As an additional variant of the method of distinguishing the cyclic blocks, when all the possible symbols are encountered for the first time, the symbol of that position (for example, the 101st, 1182st symbols, ... in the above figure) However, it is also possible to configure a single circular block including the case where the same symbols are consecutively displayed in succession. In the present embodiment, the former method is referred to as a minimum circulating block configuration method, and the latter method is referred to as a maximum circulating block configuration method. In this embodiment, both methods can be used, and a specific one can be selected by setting.

즉, 도 7은 최대순환블럭 방식을 보이고 있다. 이미 101번째 심볼에서 전체 나타날 수 있는 심볼을 최초로 1회 이상 모두 만났지만, 102번째부터 199번째까지 101번째와 동일한 심볼이 연속된 경우와 같은 경우, 199번째까지를 제 1순환블럭으로 포함하고, 모든 심볼의 출현빈도를 초기화한 다음 200번째부터 제 2순환블럭을 새롭게 구성하는 방식이다.That is, FIG. 7 shows the maximum circulating block method. In the case where the same symbols as the 102nd to 199th symbols are consecutively consecutively, the first up to the 199th symbol is included in the first symbol block, The frequency of appearance of the symbol is initialized, and the second cyclic block is newly constructed from the 200th.

반면 최소 순환블럭방식의 경우에는 도 8에 도시된 바와 같이, 이미 101번째 심볼에서 고정길이 비트 추출방식에서 전체 나타날 수 있는 심볼을 최소 1회이상 모두 만났는데, 여기에서 제 1순환블럭을 생성하고, 모든 심볼의 출현빈도를 초기화한 다음 102번째 심볼부터 제 2순환블럭을 새롭게 구성하는 방식이다.In the case of the minimum cyclic block scheme, as shown in FIG. 8, all the symbols that can be completely represented in the fixed-length bit extraction scheme in the 101-th symbol are met at least once, where the first cyclic block is generated, The appearance frequency of all the symbols is initialized, and the second circular block is newly constructed from the 102nd symbol.

본 실시예에서는 상기 중에 최소 블럭구성방식의 예를 들었으며, 최대 블럭구성방식으로 순환블럭을 구성하여 동일한 알고리즘을 적용할 수도 있다.In this embodiment, an example of the minimum block configuration method is described above, and the same algorithm can be applied by configuring the circulation block with the maximum block configuration method.

상기와 같이 제 1순환블럭이 확인되면, 해당 순환블럭의 심볼과 출현빈도의 분포표 예를 들어 표 7과 같은 표를 이용하여, 제 1순환블럭에 대해 허프만 부호화를 수행하였을 때 압축효과가 클 것인가를 미리 계산할 수 있다. 예를 들어, 어떤 지표를 이용하여 압축효과를 예측할 것인가는 여러 가지 방법이 있을 수 있는 바, 본 실시예에서는 하나의 예시로서 심볼별 출현 빈도의 분산값을 하나의 지표로 삼았으나, 심볼별 출현빈도의 분포의 경향을 나타내는 어떠한 형태의 통계학적, 수리적 지표를 이용하여도 무방하며, 다만 압축효율에 있어서는 다소 차이가 발생할 수 있다. 중요한 점은 허프만 부호화의 압축효과를 가장 극대화 할 수 있는지 여부를 확인할 수 있으면 된다. 표 7로부터, 제 1순환블럭의 각 심볼별 출현빈도의 분산값을 구하면 약 56.495 이다. 분산값을 구하는 공식은 공지의 방법으로 적용가능하므로 구체적인 계산법은 생략하도록 한다.If the first circular block is identified as described above, the compression effect may be large when Huffman coding is performed on the first circular block using a distribution table of symbols and appearance frequency of the corresponding circular block, for example, Can be calculated in advance. For example, there may be a variety of methods for predicting the compression effect using an indicator. In this embodiment, as an example, the variance value of the appearance frequency per symbol is used as one index, Any form of statistical or numerical indicators that indicate trends in the frequency distribution may be used, but there may be some differences in compression efficiency. It is important to be able to confirm whether the compression effect of Huffman coding can be maximized. From Table 7, the variance value of the frequency of appearance of each symbol of the first cyclic block is approximately 56.495. Since the formula for obtaining the dispersion value can be applied by a known method, a specific calculation method is omitted.

여기서, 이 분산값을 미리 설정해 둔 기준값 T 와 비교하여 분산값이 T 이상일 경우에는 충분히 데이터가 분산하여 분포한다는 것을 알 수 있으므로, 허프만 부호화를 이용한 압축효과가 클 것으로 예상할 수 있는데, 본 실시예에서 T=10000 으로 설정하였다.Here, when this variance value is compared with the preset reference value T and the variance value is equal to or larger than T, it can be known that the data is sufficiently distributed and distributed. Therefore, it can be expected that the compression effect using Huffman coding is large. T = 10000 in Fig.

따라서, 표 7의 제 1순환블럭까지의 데이터분포만으로는 충분한 허프만 부호화의 압축효과를 얻을 수 없을 것이다. 이와 같은 경우에는 제 1순환 블럭 및 다음 순환블럭인 제 2순환블럭의 분포값을 합산한 표 10과 같은 데이터의 출현빈도의 분산값을 구하여야 한다. 표 10은 결국 표 7의 제 1순환블럭과 표 9의 제 2순환블럭을 합한 그룹에 대한 분포를 나타낸 것이다.Therefore, the compression effect of sufficient Huffman coding can not be obtained with only the data distribution up to the first circular block in Table 7. [ In this case, the variance value of the appearance frequency of the data as shown in Table 10, which is the sum of the distribution values of the first circulation block and the second circulation block as the next circulation block, should be obtained. Table 10 shows the distribution of the first circular block of Table 7 and the second circular block of Table 9 together.

심볼symbol 출현빈도Appearance frequency 00 10681068 1One 88 22 99 33 1010 44 1010 55 99 66 1414 77 1111 88 77 99 33 1010 55 1111 55 1212 77 1313 77 1414 55 1515 44 총합계total 11821182

상기 표 10의 출현 빈도수의 분산값을 계산하면, 70826.25 이다. 따라서 T=10000을 넘으므로, 제 1순환블럭 및 제 2순환블럭을 묶어서 순환그룹으로서 획득하여 제 1순환그룹(GRP=1)이라고 하고, 제 1순환그룹은 내부적으로 2개의 순환블럭을 가지고 있고(RL=2), 범위는 1번째 데이터로부터~1182번째 데이터까지로 정의된다. 제 1순환그룹에 대한 허프만부호화를 수행하여 압축을 수행하면, 아래 표 11과 같다. CV는 심볼의 종류이며(0~15, 이진수로 표현하면 0000~1111이다), HUFF는 최소변이 허프만 트리에 따른 허프만 부호화결과이다. VAR_S는 해당 순환그룹(GRP=1) 에서의 출현 빈도수(FRQ)의 분산값을 의미한다. DIC_SIZE는 각 심볼별 허프만 코드의 비트길이 정보이다. 한편, 아래 표 11에서 COMPRESSED로 표현된 압축예상 크기는 1620 bit가 되고 원본크기는 4728 비트가 된다. 이 때 원본은 제 1순환블럭 및 제 2순환블럭을 포함하는 1번째~1182번째 데이터까지의 크기이다.
Calculating the variance of the frequency of appearance in Table 10 is 70826.25. Therefore, since T = 10000, the first circulating block and the second circulating block are grouped and acquired as a circulating group to be referred to as a first circulating group (GRP = 1), and the first circulating group internally has two circulating blocks (RL = 2), the range is defined as from the 1st data to the 1182st data. The Huffman coding for the first cyclic group is performed to perform compression, as shown in Table 11 below. CV is the kind of symbol (0 ~ 15, expressed as binary numbers, 0000 ~ 1111), and HUFF is the Huffman coding result according to the minimum variation Huffman tree. VAR_S means the variance value of the frequency of occurrence (FRQ) in the corresponding cyclic group (GRP = 1). DIC_SIZE is bit length information of Huffman code for each symbol. On the other hand, in Table 11 below, the expected compression size expressed as COMPRESSED is 1620 bits and the original size is 4728 bits. In this case, the original is the size of the first to 1182-th data including the first circulating block and the second circulating block.

CVCV FRQFRQ GRPGRP RLRL VAR_SVAR_S COMPRESSEDCOMPRESSED DIC_SIZEDIC_SIZE ORIGINALORIGINAL CVCV HUFFHUFF 00 10681068 1One 22 70286.2570286.25 10681068 1One 42724272 00 00 1One 88 1One 22 70286.2570286.25 4040 55 3232 1One 1010010100 22 99 1One 22 70286.2570286.25 4545 55 3636 22 1001010010 33 1010 1One 22 70286.2570286.25 5050 55 4040 33 1000010000 44 1010 1One 22 70286.2570286.25 5050 55 4040 44 1000110001 55 99 1One 22 70286.2570286.25 4545 55 3636 55 1001110011 66 1414 1One 22 70286.2570286.25 5656 44 5656 66 11001100 77 1111 1One 22 70286.2570286.25 4444 44 4444 77 11101110 88 77 1One 22 70286.2570286.25 3535 55 2828 88 1101011010 99 33 1One 22 70286.2570286.25 1818 66 1212 99 101011101011 1010 55 1One 22 70286.2570286.25 2525 55 2020 1010 1111011110 1111 55 1One 22 70286.2570286.25 2525 55 2020 1111 1111111111 1212 77 1One 22 70286.2570286.25 3535 55 2828 1212 1011010110 1313 77 1One 22 70286.2570286.25 3535 55 2828 1313 1011110111 1414 55 1One 22 70286.2570286.25 2525 55 2020 1414 1101111011 1515 44 1One 22 70286.2570286.25 2424 66 1616 1515 101010101010

즉, 1번째~1182번째까지의 제 1순환블럭 및 제 2순환블럭의 데이터에 대해 상기 표 11의 부호화테이블을 토대로, 각 심볼에 대해 허프만 코드로 비트할당하여 압축데이터를 생성한다.That is, for each of the first to 1182 th data of the first and second circular blocks, bits are allocated to Huffman codes on the basis of the encoding table of Table 11 to generate compressed data.

이렇게 제 1순환그룹에 대한 압축을 마치고, 다음 데이터인 1183번째 데이터부터 계속 이어서, 제 2순환그룹을 상기 제 1순환그룹과 동일한 방식을 적용하여 그룹별로 압축을 수행을 계속해 나간다. 제 2순환그룹에서는 개념적으로 1183번째 데이터가 1번째 데이터와 같다고 보고 제 1순환그룹을 생성하는 과정을 동일하게 적용하면 된다.After the compression for the first cyclic group is completed, the second cyclic group continues to be compressed on a group-by-group basis by applying the same method as the first cyclic group, starting from the 1183rd data, which is the next data. In the second cyclic group, the process of generating the first cyclic group may be similarly applied by considering that the 1183rd data is the same as the first data.

표 12에 제 2순환그룹에 대한 압축결과를 보인다. 제 2순환그룹은 16개의 순환블럭으로 구성됨을 알 수 있다. 즉 제 3순환블럭부터~ 제 18순환블럭까지의 심볼의 빈도수들에 대한 분산이 70582.36으로서 T=10000을 최초로 넘었다는 것을 알 수 있고, 1183번째에서부터 빈도수(FRQ)의 합인 2069개의 데이터를 포함하는 3251번째까지의 데이터임을 알 수 있다. 결국 8276 bit의 원본데이터를 5760비트로 압축하였음을 알 수 있다.Table 12 shows the compression results for the second cyclic group. It can be seen that the second cyclic group consists of 16 cyclic blocks. That is, it can be seen that the variance with respect to the frequency of symbols from the third to the eighteenth cyclic block is 70582.36, which exceeds T = 10000 for the first time, and includes 2069 data which is the sum of the frequency (FRQ) The data is up to the 3251th data. As a result, it can be seen that the original data of 8276 bits is compressed to 5760 bits.

CVCV FRQFRQ GRPGRP RLRL VAR_SVAR_S COMPRESSEDCOMPRESSED DIC_SIZEDIC_SIZE ORIGINALORIGINAL CVCV HUFFHUFF 00 11251125 22 1616 70582.3670582.36 11251125 1One 45004500 00 00 1One 6161 22 1616 70582.3670582.36 305305 55 244244 1One 1010010100 22 8585 22 1616 70582.3670582.36 340340 44 340340 22 11111111 33 5757 22 1616 70582.3670582.36 285285 55 228228 33 1100011000 44 5353 22 1616 70582.3670582.36 265265 55 212212 44 1110111101 55 5757 22 1616 70582.3670582.36 285285 55 228228 55 1100111001 66 7171 22 1616 70582.3670582.36 355355 55 284284 66 1001010010 77 7474 22 1616 70582.3670582.36 370370 55 296296 77 1000110001 88 5959 22 1616 70582.3670582.36 295295 55 236236 88 1011110111 99 5454 22 1616 70582.3670582.36 270270 55 216216 99 1110011100 1010 7777 22 1616 70582.3670582.36 385385 55 308308 1010 1000010000 1111 6363 22 1616 70582.3670582.36 315315 55 252252 1111 1001110011 1212 5656 22 1616 70582.3670582.36 280280 55 224224 1212 1101011010 1313 5656 22 1616 70582.3670582.36 280280 55 224224 1313 1101111011 1414 6060 22 1616 70582.3670582.36 300300 55 240240 1414 1011010110 1515 6161 22 1616 70582.3670582.36 305305 55 244244 1515 1010110101

분산값의 기준값 T와 관련하여, 순환블럭의 분포수를 합산하면서 분산을 구하는 방식도 속도면에서 경제적인데, 각 순환블럭의 분포수를 합산하면서 실제로 그에 따른 허프만 부호화를 수행하여 가장 압축효과가 높은 순환블럭의 합산조합을 계산하여, 하나의 그룹으로 묶으면서 압축을 수행할 수도 있다.The method of calculating the variance by summing the number of distributions of the circulating blocks in relation to the reference value T of the variance values is also economical in terms of speed. The Huffman coding is actually performed by summing the distributions of the respective circulating blocks, It is also possible to calculate the summation combination of the circulating blocks and perform compression while grouping them into one group.

즉, 상기에서는 제 1순환블럭+제 2순환블럭의 심볼별 출현 빈도수의 분산이 T=10000을 넘었음을 이용하여 제 1그룹으로 묶었으나, 제 1순환블럭에 대해 실제 허프만 부호화를 수행하여 그 압축률과 제 1순환블럭+제 2순환블럭에 대한 실제 허프만 부호화를 수행한 압축률, 그리고 제 1순환블럭+제 2순환블럭+제 3순환블럭에 대한 실제 허프만 부호화를 수행한 압축률 등 제 a순환블럭~제 b순환블럭에 대한 실제 허프만 부호화의 수행한 결과의 압축률 중에서 어떤 조합이 가장 효과적인지를 계산하여 그룹을 묶으면서 압축을 수행할 수도 있다. That is, in the above description, although the first cyclic block and the second cyclic block are grouped into the first group using the variance of the frequency of occurrence of each symbol exceeding T = 10000, actual Huffman coding is performed on the first cyclic block, A compression ratio, a compression ratio obtained by performing actual Huffman coding on the first cyclic block + the second cyclic block, and a compression ratio obtained by performing actual Huffman coding on the first cyclic block + the second cyclic block + the third cyclic block, To the b-th cyclic block, it is possible to compute the most effective combination among the compression ratios of the result of the actual Huffman coding and to perform compression while grouping the groups.

또는 하나의 순환 블록을 구분할 때마다 순환블럭 내의 심볼별 출현 빈도수의 분산값을 고려치 않고, 바로 하나의 순환블록을 순환그룹으로 인식하고 허프만 부호화를 통하여 압축을 수행할 수도 있다.Alternatively, each time one circular block is identified, one circular block may be recognized as a cyclic group and compression may be performed through Huffman coding without considering the variance value of the frequency of symbol occurrence in the circular block.

한편 다시 상기 분산에 따른 순환그룹 묶음법으로 돌아와서, 이렇게 각 그룹별로 압축을 수행하다 보면, 데이터의 마지막까지 오게 되는데, 마지막 그룹까지 묶어서 압축을 한 다음, 그룹을 묶는 조건을 만족하지 못한 상태로 데이터의 끝에 도달하게 된다. 이와 같은 마지막 미완의 그룹을 최종그룹이라고 명명하기로 하고, 최종그룹은 그대로 기준값 T의 조건을 만족하지 못하더라도, 허프만 부호화를 그대로 수행하도록 한다. 본 실시예에서는 제 28그룹이 최종그룹이며, 표 13에 그 압축결과를 보인다. 최종그룹은 분산조건을 만족하지 않으므로, 분산값(VAR_S)은 의미가 없고, 몇 개의 순환블럭이 묶였는지도 큰 의미가 없으므로 표 13에서는 공란으로 보인다.If the compression is performed for each group, the data is compressed until the end of the data. If the compression is performed after grouping the data of the last group, It reaches the end. The last unfinished group is referred to as the last group. Even if the last group does not satisfy the condition of the reference value T, the Huffman encoding is performed as it is. In the present embodiment, the 28th group is the last group, and the compression result is shown in Table 13. Since the final group does not satisfy the dispersion condition, the variance value (VAR_S) is meaningless, and it is not significant whether several circulating blocks are bundled.

CVCV FRQFRQ GRPGRP RLRL VAR_SVAR_S COMPRESSEDCOMPRESSED DIC_SIZEDIC_SIZE ORIGINALORIGINAL CVCV HUFFHUFF 00 219219 2828 219219 1One 876876 00 1One 1One 2525 2828 100100 44 100100 1One 01010101 22 2929 2828 116116 44 116116 22 01000100 33 1111 2828 5555 55 4444 33 0111001110 44 1717 2828 8585 55 6868 44 0001100011 55 1313 2828 6565 55 5252 55 0110001100 66 4141 2828 164164 44 164164 66 00000000 77 3535 2828 140140 44 140140 77 00100010 88 1111 2828 5555 55 4444 88 0111101111 99 1010 2828 6060 66 4040 99 000100000100 1010 55 2828 3030 66 2020 1010 011011011011 1111 66 2828 3636 66 2424 1111 011010011010 1212 99 2828 5454 66 3636 1212 001110001110 1313 99 2828 5454 66 3636 1313 000101000101 1414 77 2828 4242 66 2828 1414 001111001111 1515 1616 2828 8080 55 6464 1515 0011000110

표 14는 본 실시예의 모든 그룹별 압축효과를 표로 보인 것이다. 지면관계 상 3그룹부터 26그룹까지는 생략하였다.Table 14 is a table showing compression effects of all groups in this embodiment. From the ground level, the 3 to 26 groups are omitted.

한편, 최종순환그룹의 경우 그 특성상 모든 심볼이 1회 이상 다 나타나지 못한 상태에서 허프만 부호화를 강제적으로 수행할 수 있는데, 이 경우 이진데이터의 마지막까지 아직 나타나지 않은 심볼에 대한 허프만 코드는 없기 때문에, 최종순환그룹의 경우 존재하지 않는 심볼에 대해서는 비트길이(DIC_SIZE)에 0 을 기록함으로써, 허프만 코드열에서 각 심볼별로 허프만 코드를 분할할 때 해당 심볼에 대한 허프만 코드는 할당하지 않게 할 수 있다.On the other hand, in the case of the final cyclic group, Huffman coding can be forcibly performed in a state in which all the symbols do not appear at least one time. In this case, since there is no Huffman code for the symbols not yet displayed until the end of the binary data, In the case of the cyclic group, 0 is recorded in the bit length (DIC_SIZE) for symbols that do not exist, so that Huffman codes for the corresponding symbols can not be allocated when the Huffman codes are divided for each symbol in the Huffman code sequence.

CVCV FRQFRQ GRPGRP RLRL VAR_SVAR_S COMPRESSEDCOMPRESSED DIC_SIZEDIC_SIZE ORIGINALORIGINAL CVCV HUFFHUFF 00 10681068 1One 22 70286.370286.3 10681068 1One 42724272 00 00 1One 88 1One 22 70286.370286.3 4040 55 3232 1One 1010010100 22 99 1One 22 70286.370286.3 4545 55 3636 22 1001010010 33 1010 1One 22 70286.370286.3 5050 55 4040 33 1000010000 44 1010 1One 22 70286.370286.3 5050 55 4040 44 1000110001 55 99 1One 22 70286.370286.3 4545 55 3636 55 1001110011 66 1414 1One 22 70286.370286.3 5656 44 5656 66 11001100 77 1111 1One 22 70286.370286.3 4444 44 4444 77 11101110 88 77 1One 22 70286.370286.3 3535 55 2828 88 1101011010 99 33 1One 22 70286.370286.3 1818 66 1212 99 101011101011 1010 55 1One 22 70286.370286.3 2525 55 2020 1010 1111011110 1111 55 1One 22 70286.370286.3 2525 55 2020 1111 1111111111 1212 77 1One 22 70286.370286.3 3535 55 2828 1212 1011010110 1313 77 1One 22 70286.370286.3 3535 55 2828 1313 1011110111 1414 55 1One 22 70286.370286.3 2525 55 2020 1414 1101111011 1515 44 1One 22 70286.370286.3 2424 66 1616 1515 101010101010 00 11251125 22 1616 70582.470582.4 11251125 1One 45004500 00 00 1One 6161 22 1616 70582.470582.4 305305 55 244244 1One 1010010100 22 8585 22 1616 70582.470582.4 340340 44 340340 22 11111111 33 5757 22 1616 70582.470582.4 285285 55 228228 33 1100011000 44 5353 22 1616 70582.470582.4 265265 55 212212 44 1110111101 55 5757 22 1616 70582.470582.4 285285 55 228228 55 1100111001 66 7171 22 1616 70582.470582.4 355355 55 284284 66 1001010010 77 7474 22 1616 70582.470582.4 370370 55 296296 77 1000110001 88 5959 22 1616 70582.470582.4 295295 55 236236 88 1011110111 99 5454 22 1616 70582.470582.4 270270 55 216216 99 1110011100 1010 7777 22 1616 70582.470582.4 385385 55 308308 1010 1000010000 1111 6363 22 1616 70582.470582.4 315315 55 252252 1111 1001110011 1212 5656 22 1616 70582.470582.4 280280 55 224224 1212 1101011010 1313 5656 22 1616 70582.470582.4 280280 55 224224 1313 1101111011 1414 6060 22 1616 70582.470582.4 300300 55 240240 1414 1011010110 1515 6161 22 1616 70582.470582.4 305305 55 244244 1515 1010110101 00 477477 2727 88 12572.712572.7 477477 1One 19081908 00 1One 1One 5555 2727 88 12572.712572.7 220220 44 220220 1One 01100110 22 6565 2727 88 12572.712572.7 260260 44 260260 22 01000100 33 2626 2727 88 12572.712572.7 156156 66 104104 33 000010000010 44 5454 2727 88 12572.712572.7 216216 44 216216 44 01110111 55 5151 2727 88 12572.712572.7 255255 55 204204 55 0000000000 66 100100 2727 88 12572.712572.7 400400 44 400400 66 00010001 77 8080 2727 88 12572.712572.7 320320 44 320320 77 00110011 88 2323 2727 88 12572.712572.7 138138 66 9292 88 001010001010 99 2323 2727 88 12572.712572.7 138138 66 9292 99 001001001001 1010 1212 2727 88 12572.712572.7 7272 66 4848 1010 010111010111 1111 1616 2727 88 12572.712572.7 9696 66 6464 1111 010110010110 1212 2424 2727 88 12572.712572.7 144144 66 9696 1212 001000001000 1313 2525 2727 88 12572.712572.7 150150 66 100100 1313 000011000011 1414 1717 2727 88 12572.712572.7 102102 66 6868 1414 001011001011 1515 3030 2727 88 12572.712572.7 150150 55 120120 1515 0101001010 00 219219 2828 219219 1One 876876 00 1One 1One 2525 2828 100100 44 100100 1One 01010101 22 2929 2828 116116 44 116116 22 01000100 33 1111 2828 5555 55 4444 33 0111001110 44 1717 2828 8585 55 6868 44 0001100011 55 1313 2828 6565 55 5252 55 0110001100 66 4141 2828 164164 44 164164 66 00000000 77 3535 2828 140140 44 140140 77 00100010 88 1111 2828 5555 55 4444 88 0111101111 99 1010 2828 6060 66 4040 99 000100000100 1010 55 2828 3030 66 2020 1010 011011011011 1111 66 2828 3636 66 2424 1111 011010011010 1212 99 2828 5454 66 3636 1212 001110001110 1313 99 2828 5454 66 3636 1313 000101000101 1414 77 2828 4242 66 2828 1414 001111001111 1515 1616 2828 8080 55 6464 1515 0011000110

상술한 압축결과 원본데이터 사이즈 876,040 bit를 857,929 bit로 압축하였고, 여기에 복원을 위한 부호화사전(사전정보)을 구성하는 방법을 설명하고자 한다.A method of compressing the original data size of 876,040 bits to 857,929 bits as a result of the compression and constructing an encoding dictionary (dictionary information) for reconstruction is described here.

허프만 압축의 경우 부호화 사전을 구성하는 방법은 크게 2가지가 있는데, 첫번째 방법은, 압축시 순환그룹 생성조건을 만족했을 때 1개 순환그룹 내에서의 심볼별-출현빈도 정보를 부호화사전에 저장하고, 압축해제 시에 복원측(정보압축해제측)에서 압축시의 심볼별 허프만코드를 생성하기 위한 허프만 트리를 동적으로 구현하여, 압축데이터 내의 허프만 코드를 상기 동적으로 구현된 허프만 트리의 심볼을 이용하여 복원(압축해제)하는 방식이고, 두번째 방법은 동적으로 복원측(압축해제측)에서 허프만 트리를 생성하는 것이 속도측면에서는 불리할 수 있으므로, 압축시에 1개 순환그룹 내의 심볼별-허프만코드 데이터를 허프만 코드열로 모두 부호화사전에 저장하고, 상기 허프만 코드열에서 심볼별로 허프만코드를 각각 분리할 수 있도록 각 허프만 코드의 비트길이 정보(DIC_SIZE)를 함께 저장하여, 복원시에 압축데이터 내에서 허프만코드를 분리하여 심볼로 변환시키는 방식이다.In the case of Huffman compression, there are two methods for constructing a coding dictionary. The first method is to store the symbol-to-appearance frequency information in one cyclic group in the coding dictionary when the cyclic group generation condition in compression is satisfied , A dynamically implementing a Huffman tree for generating symbol-by-symbol Huffman codes at the decompression side (information decompression side) upon decompression, and using the dynamically implemented symbols of the Huffman tree (Decompression). In the second method, it is disadvantageous in terms of speed to generate the Huffman tree from the restoration side (decompressed side) dynamically. Therefore, in compression, the symbol-by-Huffman code Data is stored in an encoding dictionary in a Huffman code sequence, and each Huffman code is separated into symbols in each of the Huffman code strings, Bit length information (DIC_SIZE) of only the codes is also stored, and the Huffman codes are separated in the compressed data at the time of restoration and converted into symbols.

전자의 경우에는 아래 표 15, 표 16과 같이 각 순환그룹별로 FRQ로 표기된, 각 심볼별로 순차적으로 정리된 출현 빈도 정보만을 저장하여도 된다. 왜냐하면, 출현가능한 모든 심볼 0~15까지의 16개씩의 심볼별-출현빈도 정보를 이용하여 곧 하나의 순환그룹에서 압축해제를 위해 사용하기 위한 허프만 트리를 만들 수 있기 때문이다. 물론 아래 표의 출현빈도 정보값들만을 별도의 압축방법을 이용하여 재압축하여도 무방하다.In the case of the former, only the frequency-of-occurrence information sequentially listed for each symbol, which is denoted by FRQ for each cyclic group, may be stored as shown in Tables 15 and 16 below. This is because Huffman tree can be created for use in decompression in one cyclic group by using the 16-symbol-by-occurrence frequency information of all possible symbols 0 to 15. Of course, it is also possible to recompress only the appearance frequency information values in the following table using a separate compression method.

CVCV FRQFRQ 00 10681068 1One 88 22 99 33 1010 44 1010 55 99 66 1414 77 1111 88 77 99 33 1010 55 1111 55 1212 77 1313 77 1414 55 1515 44 00 11251125 1One 6161 22 8585 33 5757 44 5353 55 5757 66 7171 77 7474 88 5959 99 5454 1010 7777 1111 6363 1212 5656 1313 5656 1414 6060 1515 6161 ...... ...... 00 477477 1One 5555 22 6565 33 2626 44 5454 55 5151 66 100100 77 8080 88 2323 99 2323 1010 1212 1111 1616 1212 2424 1313 2525 1414 1717 1515 3030 00 219219 1One 2525 22 2929 33 1111 44 1717 55 1313 66 4141 77 3535 88 1111 99 1010 1010 55 1111 66 1212 99 1313 99 1414 77 1515 1616

즉, 도 9에 도시된 바와 같이, 순차적으로 저장된 심볼의 출현빈도 정보만으로부터 16개 단위로 심볼-출현빈도정보를 얻을 수 있고, 이를 이용해 압축시와 동일한 허프만 부호화방식을 이용하여 허프만트리를 구성한 뒤, 표 16과 같이 허프만 코드를 심볼별로 찾아 낼 수 있다.That is, as shown in FIG. 9, symbol-appearance frequency information can be obtained from only the appearance frequency information of sequentially stored symbols in units of 16, and a Huffman tree is constructed using the same Huffman coding method as that of compression Then, Huffman codes can be found by symbol as shown in Table 16.

심볼symbol 허프만코드Huffman code 00 00 1One 1010010100 22 1001010010 33 1000010000 44 1000110001 55 1001110011 66 11001100 77 11101110 88 1101011010 99 101011101011 1010 1111011110 1111 1111111111 1212 1011010110 1313 1011110111 1414 1101111011 1515 101010101010

이렇게 구하여진 허프만 코드를 이용하여, 압축데이터로부터 허프만 코드를 분리하여 심볼에 대해 복원(압축해제)을 할 수 있다.
By using the Huffman code thus obtained, the Huffman code can be separated from the compressed data and restored (decompressed) with respect to the symbol.

다음으로 부호화 사전을 구성하기 위한 두 번째 방법에 대해 설명하면 다음과 같다.Next, the second method for constructing the encoding dictionary will be described as follows.

복원(압축해제)를 위한 부호화사전(사전정보)을 위해, 표 14의 허프만코드(HUFF 코드)를 순차적으로 일렬로 이어붙여서 저장을 한다. 즉 표 14의 그룹 1 의 허프만 코드를 예로 들면 표 17과 같은데, 일렬로 저장을 하면 "010100100101000010001...101010" 과 같을 것이다. The Huffman codes (HUFF codes) in Table 14 are sequentially stored in a row in order for the encoding dictionary (dictionary information) for decompression (decompression). In other words, the Huffman codes of Group 1 in Table 14 are shown in Table 17 as an example. If they are stored in a row, they will be equal to "010100100101000010001 ... 101010".

HUFFHUFF 00 1010010100 1001010010 1000010000 1000110001 1001110011 11001100 11101110 1101011010 101011101011 1111011110 1111111111 1011010110 1011110111 1101111011 101010101010

이렇게 일렬로 구성된 허프만 코드열을 각 순환그룹 내에서 각각 0~15라는 심볼 단위로 순착적으로 정확하게 구분하기 위해서는, 각각의 허프만 코드열의 각 심볼별로 대응될 허프만 코드의 길이인 표 14의 각 허프만 코드의 비트길이(DIC_SIZE) 정보 또한 순차적으로 저장하여야 하는데, 그대로 1부터 7까지의 값으로 표현된 값을 저장하는 것보다는 표 14에서의 DIC_SIZE들을 각각 별도로 허프만 부호화하여 압축하여 저장하는 것도 효과적일 수 있다. 즉, DIC_SIZE 자체에 대한 압축을 위해서 표 18에서는 표 14의 DIC_SIZE 값들에 대한 그 분포값을 보이고, 표 19에서는 각 허프만코드의 비트길이(DIC_SIZE), 그 출원빈도에 따른 허프만 부호화 및 압축크기 결과를 보인다.In order to precisely and precisely divide the Huffman code strings constituted by such a series in symbol units of 0 to 15 in each circulation group, each Huffman code of Table 14, which is the length of Huffman codes corresponding to each symbol of each Huffman code string The bit length (DIC_SIZE) information of the DIC_SIZE information should also be sequentially stored. It is effective to store the DIC_SIZE information separately in the Huffman coding, compress and store the values in Table 14 rather than storing the values represented by the values 1 through 7 . Table 18 shows the distribution values for the DIC_SIZE values in Table 14, and Table 19 shows the Huffman coding and compression size results according to the bit length (DIC_SIZE) of each Huffman code and the frequency of application, see.

DIC_SIZEDIC_SIZE 출현빈도Appearance frequency 1One 77 22 44 33 1515 44 284284 55 112112 66 2222 77 44 총합total 448448

DIC_SIZEDIC_SIZE 출현빈도Appearance frequency 허프만부호화Huffman coding 압축크기Compression size 1One 77 0011000110 3535 22 44 001110001110 2424 33 1515 00100010 6060 44 284284 1One 284284 55 112112 0101 224224 66 2222 000000 6666 77 44 001111001111 2424 총합total 448448 717717

표 19의 DIC_SIZE 각각의 정보에 대해서도 허프만 부호화를 수행하면, 717bit의 압축데이터로 간결하게 표현 가능한데, DIC_SIZE를 압축하지 않으면 1부터 7까지 표현하기 위한 3비트의 이진수에 출현빈도 448회를 곱하여 1344bit가 필요하게 될 것이다.If DIC_SIZE is not compressed, the 3-bit binary number to represent 1 to 7 is multiplied by 448 times, so that 1344 bits can be obtained. It will be necessary.

DIC_SIZE 정보를 이용하면, 표 14에서 표현된 각 순환그룹 내의 심볼들에 대해 순차적으로 매핑된 허프만 코드를 일렬로 연결한 허프만 코드열에서 각 순환그룹 내의 모든 심볼에 대응하여 16개씩 허프만코드를 구분하여 할당 할 수 있고, 이러한 방식으로 결국 모든 순환그룹 내의 모든 심볼에 대하여 복원(압축해제)을 위하여 허프만 코드를 대응시킬 수가 있다. 한편, 상술한 바와 같이, 최종순환그룹의 경우 그 특성상 모든 심볼이 1회 이상 다 나타나지 않은 상태에서도 허프만 부호화를 강제적으로 수행할 수 있는데, 그 특성상 최종순환 그룹의 경우 특정 심볼에 대해 DIC_SIZE에 0 을 기록하였다면, 허프만 코드열을 분할할 때 해당 심볼에 대한 허프만 코드는 할당하지 않게 할 수 있다.Using the DIC_SIZE information, Huffman codes, which are sequentially mapped to symbols in each cyclic group represented in Table 14, are divided into 16 Huffman codes corresponding to all the symbols in each cyclic group in the Huffman code sequence in which the Huffman codes are connected in series And in this way eventually, Huffman codes can be mapped for restoration (decompression) for all symbols in all cyclic groups. As described above, Huffman coding can be forcibly performed in a final cyclic group even when all symbols are not present at least one time. Due to the characteristics, in the final cyclic group, DIC_SIZE is set to 0 for a specific symbol , The Huffman code for the symbol can be prevented from being allocated when the Huffman code sequence is divided.

한편, 표 17의 그룹 1에 대한 허프만 코드열 "010100100101000010001 ...101010"에 대하여, 그룹 1의 각 심볼에 대한 허프만코드의 비트길이(DIC_SIZE) 정보의 허프만 압축데이터 010100100101000010001...101010.... 를 표 19의 부호화테이블을 이용하여 해독할 수 있다. DIC_SIZE는 1,5,5,5,5,5,4,4,5,6,5,.... 임을 알 수 있고, 이 정보를 이용하여 허프만 코드열을 해당 DIC_SIZE 값의 비트길이로 각각 분할하면 표 17의 허프만 코드로 완벽히 복구된다. 각 허프만 코드는 4비트 고정길이 추출 심볼의 경우에는 16개마다 1개 순환그룹이 사용하는 심볼별 - 허프만 코드임을 자연스럽게 알 수 있다. 이 때의 심볼은 0~15 까지의 순차적인 순서에 따른 허프만 코드의 매핑이다. 즉, 몇 비트 고정길이 추출을 수행하였는지를 알면, 몇 개의 허프만 코드마다 1개의 순환그룹이 사용하는 허프만 코드인지를 알 수 있는 것이다.On the other hand, Huffman compressed data 010100100101000010001 ... 101010 of the bit length (DIC_SIZE) information of the Huffman code for each symbol of the group 1 is added to the Huffman code string "010100100101000010001 ... 101010" Can be decrypted using the encoding table in Table 19. < tb > < TABLE > DIC_SIZE is 1, 5, 5, 5, 5, 4, 4, 5, 6, 5, .... Using this information, the Huffman code sequence is divided into the bit length of the corresponding DIC_SIZE value When divided, the Huffman code shown in Table 17 is completely restored. It is natural that each Huffman code is a symbol-by-Huffman code used by one cyclic group every 16 for a 4-bit fixed length extracted symbol. The symbols at this time are the mapping of the Huffman codes according to the sequential order from 0 to 15. That is, if it is known how many bits fixed length extraction has been performed, it can be known whether Huffman codes are used by one cyclic group for each of several Huffman codes.

따라서, 복원(압축해제)을 위한 부호화사전(사전정보)으로서 요약하면, 표 16의 각 셀의 허프만 코드를 연결한 허프만 코드열/DIC_SIZE의 부호화테이블이 필요하다. Therefore, in summary as an encoding dictionary (dictionary information) for decompression (decompression), a coding table of Huffman code sequence / DIC_SIZE in which Huffman codes of each cell in Table 16 are connected is needed.

참고로, 상기와 같은 복원(압축해제) 시에는 각 그룹별로 몇 개의 순환블럭이 포함되어 있는 것인지에 대한 정보(RL)는 필요하지 않은데, 그 이유는 복호화하면서 복원되는 순환블럭 내의 심볼의 출현빈도의 분산이 특정 기준산포도(상기에서는 T = 10000) 미만이면 자연스럽게 계속 다음 순환블럭까지 합산하여 심볼의 출현빈도를 계산하고, 반대로 기준산포도 이상이면 그 때까지의 심볼의 집합을 순환그룹으로 자연스럽게 분리할 수 있기 때문이다.
For reference, in the decompression (decompression) as described above, information (RL) about how many cyclic blocks are included in each group is not needed because the frequency of occurrence of symbols in the cyclic block If the variance is less than a specific reference scale (T = 10000 in the above example), the frequency of appearance of symbols is calculated naturally by adding up to the next cycle block. On the other hand, It is because.

복원방법(압축해제 방법)을 설명하면 다음과 같다.The restoration method (decompression method) will be described as follows.

도 10은 데이터 복원시 허프만 코드열에서 허프만 코드를 분리하여 각 순환그룹별로 심볼별 허프만 부호화사전을 복원하는 방법을 개념적으로 나타낸 것이다. 그리고, 도 12는 도 10에서 복원된 순환그룹별 심볼별 허프만 부호화사전을 이용하여 실제 압축데이터에서 원 심볼로 데이터를 복원하는 과정을 나타낸 것이다. 원본 이진데이터에 대한 허프만코드 압축데이터에 대하여 허프만 코드열 및 DIC_SIZE정보를 이용하여, 각 그룹별로 적용될 0~15까지의 각 심볼당 허프만 코드를 해독하면 아래와 같다. 16개씩의 허프만 코드당 1개의 그룹에 적용될 압축해제를 위한 각 심볼 0~15에 대한 순차적인 허프만 코드인 것이다. FIG. 10 conceptually shows a method of separating Huffman codes from a Huffman code string during data restoration and restoring symbol-by-symbol Huffman coding dictionaries for each cyclic group. FIG. 12 illustrates a process of restoring data from actual compressed data to original symbols using the Huffman-encoded dictionaries for each symbol of each cyclic group restored in FIG. The Huffman codes for the original binary data and the DIC_SIZE information for the compressed data are used to decode the Huffman codes for each symbol from 0 to 15 to be applied to each group. Is a sequential Huffman code for each symbol 0-15 for decompression to be applied to one group per 16 Huffman codes.

도 10을 참조하면, 허프만 코드열에 대하여 DIC-SIZE 압축해제 데이터로부터 16개의 허프만 코드를 분리하면, 아래 표 20과 같다. 표 20은 제 1순환그룹에 대한 복호화를 위해 필요한 심볼별 허프만 코드 부호화사전 정보가 된다. 이제 압축데이터의 최상위비트로부터 하위비트 방향으로 이동하면서, 아래 허프만 코드별로 할당된 심볼에 대해 복호화를 진행한다.Referring to FIG. 10, 16 Huffman codes are separated from the DIC-SIZE decompressed data for the Huffman code string. Table 20 shows symbol-by-symbol Huffman code encoding dictionary information necessary for decoding for the first cyclic group. Now, moving from the most significant bit of the compressed data to the lower bit direction, decoding is performed on the symbols assigned to each of the lower Huffman codes.

그런데, 압축데이터 속의 연속된 허프만 코드에서, 어떻게 제 1순환그룹의 마지막 심볼을 복호화하는 마지막 허프만 코드와 제 2순환그룹의 첫번째 심볼을 복호화하는 첫번째 허프만 코드가 자연스럽게 분리되는가에 대해 이론적인 설명을 추가하면, 제 1순환그룹의 마지막 허프만 코드가 H1 이라고 하면, 제 2순환그룹 허프만 코드의 최상위 첫비트는 0 또는 1 두가지 중에서 반드시 시작한다. 그런데 H1 + '0', H1 +'1'이 제 1순환그룹의 특정 다른 심볼을 복호화하는데 사용되는 허프만 코드라고 만약 가정한다면, 어떤 심볼-출현빈도 정보들에 대한 허프만 코드들의 접두사코드(prefix-code)로서의 성질, 즉 특정 허프만 코드는 다른 허프만 코드의 접두사로서 포함되지 않는다는 기본적인 허프만 코드의 특성을 만족시키지 못하므로, 그러한 허프만 코드는 존재할 수 없다. 즉 제 1순환그룹 내에서 H1, H1+'0....', H1+'1....' 과 같이 H1 코드가 다른 허프만 코드의 접두사 코드로 포함되는 코드는 존재하지 않는다. 따라서, 반드시 제 1순환그룹 내의 마지막 허프만 코드는 H1 로 유일하게 구분될 수 있고, 이에 따라 마지막 심볼을 복호화하여 제 1순환그룹 내에서 마지막 순환블럭을 구성하게 되는 모든 종류의 출현가능한 심볼을 처음으로 1회 이상 만나게 됨으로써, 압축데이터로부터 순환그룹 간의 심볼분리는 매우 자연스럽게 일어나게 된다.However, in the continuous Huffman code in the compressed data, a theoretical explanation is given as to how the last Huffman code for decoding the last symbol of the first cyclic group and the first Huffman code for decoding the first symbol of the second cyclic group are separated naturally If the last Huffman code of the first cyclic group is H1, then the most significant first bit of the second cyclic group Huffman code necessarily starts from either 0 or 1. However, if it is assumed that H1 + '0' and H1 + '1' are Huffman codes used to decode certain other symbols of the first cyclic group, the prefix codes of Huffman codes for certain symbol- code, that is, a specific Huffman code does not satisfy the basic Huffman code characteristic that it is not included as a prefix of another Huffman code, such a Huffman code can not exist. That is, in the first cyclic group, there is no code including the H1 code as a prefix code of another Huffman code such as H1, H1 + '0 ....', H1 + '1 ....' Therefore, the last Huffman code in the first cyclic group can be uniquely identified as H1, thereby deciphering the last symbol to thereby generate all kinds of emergent symbols that constitute the last cyclic block in the first cyclic group The symbol separation between the cyclic groups from the compressed data occurs very smoothly.

GRPGRP 심볼symbol HUFFHUFF 1One 00 00 1One 1One 1010010100 1One 22 1001010010 1One 33 1000010000 1One 44 1000110001 1One 55 1001110011 1One 66 11001100 1One 77 11101110 1One 88 1101011010 1One 99 101011101011 1One 1010 1111011110 1One 1111 1111111111 1One 1212 1011010110 1One 1313 1011110111 1One 1414 1101111011 1One 1515 101010101010

4비트 고정비트열 추출에 따른 순환블럭 방식과 관련하여, 본 실시예의 경우 압축 당시에 최소순환블럭 방식을 이용하여 구성하였기 때문에, 복호화를 진행하면서 복호화된 심볼의 빈도수 분포를 계산하면서 모든 심볼을 처음으로 1회 이상 만나는 경우에 그 때까지 수집된 심볼들의 출현빈도의 정보로 산포도를 계산하여, 그 값이 기준 산포도 이상이면 해당 순환블럭은 순환그룹으로 분리되지만, 그 값이 기준 산포도 이상이 되지 못할 경우에는 해당 순환블럭은 순환그룹으로 분리되지 못하고 계속하여 심볼들을 수집한다. 이 때 해당 순환블럭이 순환그룹으로 독립되지는 못했지만, 다음 순환블럭을 확인하기 위해서는 기존 계산되었던 심볼의 빈도수 분포를 모두 0 으로 리셋(Reset)한 다음, 그 다음 순환블럭에 대하여 심볼의 출현빈도수를 계산하여야 한다.In this embodiment, since the minimum cyclic block method is used at the time of compression, the frequency distribution of decoded symbols is calculated while decoding is performed, In the case of one or more encounters, the scattering degree is calculated from the frequency of appearance of the collected symbols until that time. If the value is equal to or greater than the reference scattering degree, the corresponding cyclic block is divided into the cyclic groups. However, The corresponding cyclic block can not be separated into the cyclic groups and continuously collects the symbols. In this case, although the corresponding cyclic block is not independent of the cyclic group, in order to check the next cyclic block, the frequency distribution of the previously calculated symbols is reset to 0, and the frequency of appearance of the symbol for the next cyclic block is Should be calculated.

한편,이렇게 하여 다시 처음으로 모든 심볼을 처음으로 1회 이상 만날 새로운 순환블럭을 확인한 뒤, 합산된 두 개의 순환블럭 내의 심볼들의 집합을 대상으로 하여 심볼의 출현 빈도수 정보로부터 산포도를 계산하고, 그 값이 기준산포도 이상이면 비로소 상기 두 개의 순환블럭이 순환그룹으로 분리될 수 있음을 자연스럽게 알 수 있다. 즉 복호화를 할때에는 순환블럭을 확인할 때마다 그 때까지 수집된 순환블럭 내의 모든 심볼에 대한 산포도를 계산하여 그 값이 기준산포도 이상인지 여부 그 자체가 바로 순환그룹의 분리기준이 된다.In this way, after confirming a new cyclic block in which all the symbols are first to be encountered more than once for the first time, the scattering degree is calculated from the appearance frequency information of the symbols about the set of symbols in the two cyclic blocks, It is naturally understood that the two circulating blocks can be separated into the circulating group only if the reference scattering degree is greater than the reference scattering degree. That is, each time a decoding block is identified, the scattering degree of all the symbols in the collected block is calculated, and whether the value is equal to or greater than the reference dispersion is itself a separation criterion of the cyclic group.

그런데, 만약에 순환블럭 구성방식을 최대순환블럭 구성방식을 적용했다면, 복호화를 진행하면서 복호화된 심볼의 빈도수 분포를 계산하면서, 모든 심볼을 처음으로 1회 이상 만나는 경우 모든 심볼을 처음으로 1회 이상 만났을 때의 기준 심볼의 위치 다음에 기준 심볼과 동일한 심볼이 계속 이어져서 나타나는지 확인하고, 만약 이어져서 나타난다면 그 심볼들까지를 모두 만났을 때 한 개의 순환그룹 내의 순환블럭으로 인식하는 것이다. 그리고, 이러한 순환블럭들에 대하여 산포도를 구하여 순환그룹을 얻는 과정은 상기 최소순환블럭 방식일 때와 동일하다.However, if the circulating block configuration method is applied to the maximum cyclic block configuration method, if the frequency distribution of decoded symbols is calculated while decoding, if all the symbols are encountered more than once for the first time, It is confirmed whether or not the same symbol as the reference symbol continues to be displayed next to the position of the reference symbol at the time of the encounter, and if it occurs, it is recognized as a circular block in one circular group when all the symbols are encountered. The procedure for obtaining the scattering degree for the circular blocks and obtaining the circular group is the same as that for the minimum circular block method.

한편, 본 실시예는 최소순환블럭 구성 방식의 실시예이고, 순환그룹 1의 경우 제 1순환블럭, 제 2순환블럭에 해당하는 심볼에 대한 표 20의 테이블을 이용한 복호화는 모두 끝나게 된다.In the case of the cyclic group 1, the decoding using the table of Table 20 for the symbols corresponding to the first cyclic block and the second cyclic block is completed.

다음으로, 원본 이진데이터에 대한 허프만코드 압축데이터에 대하여 허프만 코드열 및 DIC_SIZE 정보를 이용하여, 17번째~32번째의 제 2순환그룹에 적용될 0~15까지의 각 심볼당 허프만 코드를 해독하여, 제 2순환그룹의 복호화를 위한 허프만 복호화 테이블을 구성하면 표 21와 같다. 이제 압축데이터에서 제 1그룹 복호화까지 진행된 데이터의 다음 데이터부터 다시 이어서 하위비트방향으로 이동하면서, 아래 허프만 코드별로 할당된 심볼에 대해 복호화를 진행한다. 제 1순환그룹의 복호화와 같은 방식으로 진행하는데 계속 표 21의 테이블을 기준으로 제 2순환그룹의 데이터로서 복호화를 진행하면 된다. Next, Huffman code for each symbol from 0 to 15 to be applied to the 17th to 32nd second cyclic groups is decoded using Huffman code string and DIC_SIZE information for Huffman code compressed data for original binary data, Table 21 shows a Huffman decoding table for decoding the second cyclic group. Now, from the next data of the data up to the first group decoding from the compressed data, the decoding is performed on the symbols allocated to the lower Huffman codes while moving in the lower bit direction. The decryption is performed in the same manner as the decryption of the first cyclic group but continues as the data of the second cyclic group on the basis of the table of Table 21. [

GRPGRP 심볼symbol HUFFHUFF 22 00 00 22 1One 1010010100 22 22 11111111 22 33 1100011000 22 44 1110111101 22 55 1100111001 22 66 1001010010 22 77 1000110001 22 88 1011110111 22 99 1110011100 22 1010 1000010000 22 1111 1001110011 22 1212 1101011010 22 1313 1101111011 22 1414 1011010110 22 1515 1010110101

이와 같이, 상기와 같은 과정을 계속적으로 반복하며, 마지막 최종그룹의 경우에는, 압축데이터의 마지막에 도달할 때까지 해당 그룹의 허프만 테이블을 참조하여 압축해제과정을 모두 진행함으로써, 압축해제과정은 종료하게 된다.As described above, the above process is continuously repeated. In the case of the last group, the decompression process is continued by referring to the Huffman table of the group until the end of the compressed data is reached, .

본 발명의 효과를 보다 정량적으로 보이면, 표 14로부터 심볼에 대한 허프만 압축데이터 857,929 bit + DIC_SIZE압축데이터 717 bit 및 DIC_SIZE 압축해제테이블(표 19의 DIC_SIZE별 허프만코드데이터) 등의 데이터로서 대략적으로 858,889 bit에서 최종 압축데이터를 얻을 수 있다. 그런데 표 2를 보듯이 동일한 데이터에 대하여, 최소변이 허프만 트리를 이용한 허프만부호화를 수행하면, 875,811 bit로밖에는 압축이 되지 않으며, 이마저도 허프만 부호화사전(허프만 사전테이블)을 제외한 값이다. If the effect of the present invention can be seen more quantitatively, data of 857, 892 bits + DIC_SIZE compressed data 717 bits and DIC_SIZE decompression table (Huffman code data per DIC_SIZE in Table 19) The final compressed data can be obtained. However, as shown in Table 2, if Huffman coding is performed using the minimum variation Huffman tree for the same data, only 875,811 bits are compressed, and the value is also a value excluding the Huffman coding dictionary (Huffman dictionary table).

따라서 본 실시예에 따르면 약 17,000 bit 의 추가압축효과를 얻을 수 있으며, 특히 출현빈도가 거의 균등하여 허프만부호화의 효율이 떨어지는 데이터 분포에서도 사용이 가능하다.Thus, according to the present embodiment, an additional compression effect of about 17,000 bits can be obtained. In particular, it is possible to use the data distribution in which the occurrence frequency is almost even and the efficiency of Huffman coding is inferior.

본 실시예에서는 4비트단위로 구분된 심볼을 기준으로 하여 0~15까지의 16개의 심볼을 기준으로 하였으나, n비트 단위로 구분된 심볼을 기준으로 0~2^k-1 까지의 2^k 개의 심볼을 기준으로 하여 기준값(T)을 적절히 데이터의 전체적인 크기와 분산에 따라 적절하게 사용자가 지정하여 최적의 압축률을 보이도록 설정할 수 있다.In this embodiment, 2 ^k symbols of but on the basis of 16 symbols, to the basis of the symbols separated by the n-bit units from 0 to 2 ^k-1 from 0 to 15 on the basis of symbols that are separated by 4 bits, The user can designate the reference value T appropriately according to the overall size and dispersion of the data so as to show the optimum compression ratio.

그 예시로서, 8비트 단위로 구분하여 데이터를 읽어 들일 경우 출현가능한 심볼은 0~255 까지 256개의 심볼이 가능하다.For example, when data is divided into 8-bit units, 256 symbols from 0 to 255 are possible.

5,682,432 bit의 원본 이진데이터를 8비트 단위로 읽어 들이면, 710,304 byte가 되는데, 심볼별 출현빈도는 아래표 21와 같다.When the original binary data of 5,682,432 bits is read in units of 8 bits, 710,304 bytes are obtained.

심볼symbol 출현빈도Appearance frequency 00 74657465 1One 38783878 22 34553455 33 41634163 44 22372237 55 26962696 66 36443644 77 28832883 88 20292029 99 21932193 1010 23832383 1111 24752475 1212 30673067 1313 25062506 1414 1205112051 1515 22622262 1616 16381638 1717 21122112 1818 24292429 …... …... 240240 31813181 241241 34183418 242242 11801180 243243 17971797 244244 17221722 245245 13941394 246246 19661966 247247 14211421 248248 31513151 249249 13921392 250250 17561756 251251 17211721 252252 21032103 253253 16501650 254254 17401740 255255 17791779

전체 710,304개의 심볼들을 최상위부터 최하위로 이동하면서(또는 그 반대로) 256가지의 심볼을 1회이상 처음 만날때마다 상술한 최소순환블럭 또는 최대순환블럭 단위로 구분하는 방식 중 하나로 구분한 뒤, 해당 신규 블럭이 구분될 때마다, 해당 순환그룹 및 해당 신규 블럭까지를 포함한 256가지 심볼별 출현 빈도수를 계산한 뒤, 출현 빈도수들의 분산값을 계산하여 특정값(기준값)이 넘지 못할 경우(본 실시예에서는 10,000으로 설정했음), 신규 순환블럭들은 통합하여 순환그룹을 계속 키우는 과정을 거치고, 분산이 특정임계치를 넘어서면 해당 순환그룹에 대하여 허프만 부호화를 수행하는 과정을 반복한다. 상기 설명 과정은 4비트 고정비트열로 추출할 것인지 8비트 고정비트열로 추출할 것인지의 차이만 있고, 나머지 과정은 모두 동일하다.A method of dividing 256 symbols into one of the minimum circulation block or the maximum circulation block unit every time the first intersection of the 710,304 symbols is shifted from the top to the bottom (or vice versa) The frequency of occurrence of 256 symbols including the corresponding cyclic group and the corresponding new block is calculated and then the variance value of the appearance frequency is calculated and when the specific value (reference value) is not exceeded (10,000 in this embodiment) And the new cyclic blocks are integrated to continue the cyclic grouping. If the distribution exceeds a certain threshold value, Huffman coding is repeated for the cyclic group. In the above description, only the 4 bit fixed bit stream is extracted or the 8 bit fixed bit stream is extracted.

다만 8비트 고정비트열 추출한 경우에는, 본 명세서에서는 순환블럭 구성시 최대순환블럭 단위로 구분하는 방식으로 압축을 수행한 예시를 표 23에 보인다.However, in the case of extracting an 8-bit fixed bit stream, in this specification, an example in which compression is performed in a manner of dividing into a maximum cyclic block unit in a cyclic block configuration is shown in Table 23.

전체 5,682,432 bit 에 대하여 본 실시예에 따른 방법을 적용한 결과 5,320,225 bit로 압축되었으며(부호화사전 포함), 일반적인 단회 허프만 부호화 적용결과 5,544,278 bit로 압축되어(부호사전 제외) 압출률의 개선이 있음을 알 수 있다.As a result of applying the method according to the present embodiment to all 5,682,432 bits, it was compressed to 5,320,225 bits (including a coding dictionary), and it was found that the compression rate was reduced to 5,544,278 bits have.

아래 표 23을 보듯이, 순환그룹(GRP)은 25개가 생성되었고, 마지막 25번 순환그룹은 순환그룹 내 분산값이 기준값을 넘기지 못한 채로, 데이터의 끝에 도달한 경우이기 때문에, 순환블럭의 구분의 구분없이 그룹단위에서 허프만 코드를 생성한 결과이고, 제 1순환그룹의 경우에는 12개의 순환블럭이 합해진 결과 심볼별 출현빈도의 분산값이 기준값(본 발명에서는 10000이상)을 넘어서서 12개 순환블럭을 포함하여 제 1순환그룹이 되고, 이에 대한 허프만 부호화를 수행한 결과이다.
As shown in the following Table 23, 25 recurring groups (GRPs) are generated. Since the last 25 recurring groups have reached the end of data while the variance value in the recurring group has not exceeded the reference value, The result of summing 12 cyclic blocks in the case of the first cyclic group is that the variance value of the occurrence frequency per symbol exceeds the reference value (10000 or more in the present invention), and 12 cyclic blocks are generated And is a result of performing Huffman coding on the first cyclic group.

CVCV FRQFRQ GRPGRP RLRL VAR_SVAR_S COMPRESSEDCOMPRESSED DIC_SIZEDIC_SIZE ORIGINALORIGINAL HUFFHUFF 00 14791479 1One 1212 16946.6816946.68 59165916 44 1183211832 11101110 1One 9696 1One 1212 16946.6816946.68 768768 88 768768 1100000011000000 22 7979 1One 1212 16946.6816946.68 711711 99 632632 001001000001001000 33 112112 1One 1212 16946.6816946.68 896896 88 896896 1000100010001000 44 8888 1One 1212 16946.6816946.68 792792 99 704704 000001000000001000 55 8585 1One 1212 16946.6816946.68 765765 99 680680 000101000000101000 66 118118 1One 1212 16946.6816946.68 944944 88 944944 0111111001111110 77 9393 1One 1212 16946.6816946.68 744744 88 744744 1101011011010110 88 8282 1One 1212 16946.6816946.68 738738 99 656656 000111000000111000 99 7979 1One 1212 16946.6816946.68 711711 99 632632 001001001001001001 1010 8080 1One 1212 16946.6816946.68 720720 99 640640 001000100001000100 1111 8787 1One 1212 16946.6816946.68 783783 99 696696 000011010000011010 …... …... …... …... …... …... …... …... …... 250250 6969 1One 1212 16946.6816946.68 621621 99 552552 011000101011000101 251251 7575 1One 1212 16946.6816946.68 675675 99 600600 001011111001011111 252252 115115 1One 1212 16946.6816946.68 920920 88 920920 1000010110000101 253253 8888 1One 1212 16946.6816946.68 792792 99 704704 000000011000000011 254254 101101 1One 1212 16946.6816946.68 808808 88 808808 1010111110101111 255255 150150 1One 1212 16946.6816946.68 12001200 88 12001200 0011010000110100 …... …... …... …... …... …... …... …... …... 00 17801780 2525 89008900 55 1424014240 0000100001 1One 176176 2525 14081408 88 14081408 0100101001001010 22 295295 2525 20652065 77 23602360 10000011000001 33 151151 2525 12081208 88 12081208 0111101001111010 44 9595 2525 855855 99 760760 001110000001110000 55 283283 2525 19811981 77 22642264 10011011001101 66 169169 2525 13521352 88 13521352 0101001001010010 77 105105 2525 945945 99 840840 001000110001000110 88 201201 2525 16081608 88 16081608 0010110000101100 99 186186 2525 14881488 88 14881488 0011111100111111 1010 100100 2525 900900 99 800800 001011110001011110 …... …... …... …... …... …... …... 250250 144144 2525 11521152 88 11521152 1000111110001111 251251 114114 2525 10261026 99 912912 000000101000000101 252252 9292 2525 828828 99 736736 010000011010000011 253253 132132 2525 10561056 88 10561056 1011110010111100 254254 101101 2525 909909 99 808808 001010111001010111 255255 147147 2525 11761176 88 11761176 1000010110000101

이상 살펴 본 바와 같이 본 실시예에 따른 이진 데이터의 압축 및 복원 방법과 장치는, 간단한 연산과 하드웨어적 구성을 통해 이진 데이터를 신속하고 효율적으로 압축하고 복원할 수 있고, 압축률도 뛰어나며 압축 데이터 및 복원 데이터의 신뢰성도 높일 수 있을 뿐만 아니라 데이터 전송시 전송효율과 속도도 향상시킬 수 있다.As described above, the method and apparatus for compressing and restoring binary data according to the present embodiment can quickly and efficiently compress and restore binary data through a simple operation and a hardware configuration, Not only the reliability of data can be increased, but also transmission efficiency and speed can be improved in data transmission.

이상에서 본 발명의 실시 예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고, 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.
While the invention has been shown and described in detail in the foregoing description, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art, Of the right.

100 : 이진데이터 압축장치
110 : 압축부 111 : 순환그룹 획득부
112 : 압축데이터 생성부
120 : 출력부
200 : 이진데이터 복원장치
210 : 입력부 220 : 복원부
221 : 순환그룹 복원부 222 : 결합부100: binary data compression device
110: compression unit 111: circulation group acquisition unit
112: compressed data generation unit
120: Output section
200: Binary data restoration device
210: input unit 220:
221: circulation group restoring unit 222:

Claims

A method of compressing binary data of a compression device,
Wherein the compressing unit scans the original binary data in units of n bits in a first direction to obtain a plurality of recursive blocks having a plurality of symbols of n bits in length, a binary block the acquisition phase rotation block having symbols meet until all 2 ⁿ different types of the symbol is at least once when scanned sequentially;
Obtaining a plurality of cyclic groups each including at least one cyclic block based on the scattering of the frequencies at which each of the symbols appears within one cyclic block or a plurality of neighboring cyclic blocks;
The compression unit compresses each of the cyclic groups by performing Houghman coding on the symbols in the respective cyclic groups to generate a plurality of compression cyclic groups; And
And the compressing unit combines the plurality of compression cyclic groups to generate compressed data.

The method according to claim 1,
Wherein acquiring the cyclic group comprises:
Computing a first scatterplot of the frequencies at which the symbols appear in the current cyclic block;
Comparing the first scatter diagram with a reference scatter diagram; And
And obtaining the current cyclic block as the cyclic group if the first scatterplot is equal to or greater than the reference scatterplot.

3. The method of claim 2,
Calculating a second scattering degree of the frequencies at which the symbols appear in the current circular block and at least one next circular block if the first scatterplitude is less than the reference scatterplitude, And acquiring the at least one next circular block as the circular group,
Wherein the number of the at least one next circular block is a minimum number that allows the second scatterplot to be greater than the reference scatterplot.

The method according to claim 1,
Wherein an index is employed as to whether the variance, standard deviation, skewness, or kurtosis of the frequency numbers are employed, or whether at least some kinds of symbols occupy more than a reference rate of the total frequency. Way.

The method according to claim 1,
Further comprising performing Huffman coding on the binary numbers of the remaining groups remaining after the plurality of cyclic group acquiring steps, based on the frequencies at which the compression unit appears the symbols in the remaining group to obtain a compression residual group A binary data compression method.

The method according to claim 1,
Wherein the generating the plurality of compressed cyclic groups comprises:
Generating a Huffman code for each symbol according to the frequency of each symbol in each cyclic group; And
And replacing each symbol in each cyclic group with the corresponding Huffman code.

The method according to claim 6,
Wherein the generating the plurality of compressed cyclic groups comprises:
Generating a coding dictionary including a Huffman code sequence generated by concatenating the Huffman codes corresponding to the respective symbols in each of the circulation groups in a row and a bit length of each of the Huffman codes in the Huffman code sequence A binary data compression method.

The method according to claim 6,
Wherein the generating the plurality of compressed cyclic groups comprises:
And generating an encoding dictionary including information on the frequency of appearance of each symbol in each cyclic group.

A method for restoring binary data compressed by the binary data compression method according to claim 7 or 8,
Wherein the restoration unit reconstructs the plurality of compression cyclic groups into a plurality of cyclic groups by referring to the encoding dictionary for each cyclic group to restore the binary data.

A binary data compression apparatus comprising:
Scanning the original binary data in units of n bits in a first direction to obtain a plurality of circulating blocks having a plurality of symbols of n bits length,
Obtaining a plurality of cyclic groups each including at least one cyclic block based on a scattering degree of frequencies in which each of the symbols appears within one cyclic block or a plurality of neighboring cyclic blocks,
Performing a Huffman encoding on the symbols in each of the circulation groups to generate a plurality of compression cyclic groups by compressing each of the cyclic groups,
And a compression unit for combining the plurality of compression cyclic groups to generate compressed data,
Wherein each of the circulating blocks is a binary block having symbols that meet until at least one of the different 2 ^{< n >} kinds of symbols appear at least once when the original binary data is sequentially scanned in the first direction. Device.

11. The method of claim 10,
Wherein the compressing unit calculates a first scattering degree of the frequencies in which the symbols appear in the current cyclic block, compares the first scattering degree with a reference scattering degree, and if the first scattering degree is equal to or greater than the reference scattering degree, As the cyclic group, the cyclic block of the binary data.

12. The method of claim 11,
If the first scatterplit is less than the reference scatterplitude, the compressing unit computes a second scatterplot of the frequencies at which the symbols appear in the current circular block and at least one subsequent circular block, and if the second scatterplot is greater than or equal to the reference scatterplot The current circular block and the at least one next circular block are obtained as the circular group,
Wherein the number of the at least one next circular block is a minimum number that allows the second scatterplot to be equal to or greater than the reference scatterplot.

11. The method of claim 10,
Wherein an index is employed as to whether the variance, standard deviation, skewness, or kurtosis of the frequency numbers are employed, or whether at least some kinds of symbols occupy more than a reference rate of the total frequency. Device.

11. The method of claim 10,
Wherein the compression unit further obtains a compressed residual group by performing Huffman encoding based on the frequencies at which the symbols appear in the residual group, with respect to binary numbers of remaining groups remaining after acquiring the plurality of cyclic groups. Compression device.

11. The method of claim 10,
When generating the plurality of compression cyclic groups, the compression unit generates a Huffman code for each symbol according to the frequency of each symbol in each cyclic group, and replaces each symbol in each cyclic group with the corresponding Huffman code Wherein the binary data compression device comprises:

16. The method of claim 15,
Wherein the compression unit further generates a coding dictionary including a Huffman code string generated by concatenating the Huffman codes corresponding to the respective symbols in a row and a bit length of each Huffman code in the Huffman code string. Compression device.

16. The method of claim 15,
Wherein the compression unit further generates an encoding dictionary including information on an occurrence frequency of each symbol in each cyclic group.

An apparatus for restoring binary data compressed by the binary data compression apparatus according to claim 16 or 17,
And restoring the binary data by restoring the plurality of compression cyclic groups into a plurality of cyclic groups by referring to the encoding dictionary for each cyclic group.