KR20200047003A

KR20200047003A - Method for compressing and transfering data and apparatus thereof

Info

Publication number: KR20200047003A
Application number: KR1020180128931A
Authority: KR
Inventors: 김성일
Original assignee: 삼성에스디에스 주식회사
Priority date: 2018-10-26
Filing date: 2018-10-26
Publication date: 2020-05-07
Also published as: KR102480450B1

Abstract

Provided is a compressed transmission method for record type data having a plurality of fields. A data compression transmission method performed by a computing device comprises the steps of: receiving original data composed of a set of records having a plurality of fields; obtaining compressed data for the original data by removing duplicate field values for each field of the original data; and transmitting the obtained compressed data. The transmitting step may include the step of grouping field values located adjacent to each other on the original data to form a transmission group and the step of classifying and transmitting the compressed data in units of the transmission group.

Description

Data compression transmission method and apparatus thereof METHOD FOR COMPRESSING AND TRANSFERING DATA AND APPARATUS THEREOF}

본 발명은 데이터 압축 전송 방법 및 그 장치에 관한 것이다. 보다 자세하게는, 복수의 필드를 갖는 레코드의 집합으로 구성된 원본 데이터를 무손실 방식으로 압축하는 방법 및 그 방법을 수행하는 장치에 관한 것이다.The present invention relates to a method and apparatus for transmitting data compression. More particularly, it relates to a method for compressing original data composed of a set of records having a plurality of fields in a lossless manner and an apparatus for performing the method.

대용량의 데이터를 저장하기 위해서는 많은 양의 저장 공간이 필요하기 때문에, 저장 공간을 보다 효율적으로 관리하기 위해 압축 기술에 대한 연구가 활발히 진행 중이다.Since a large amount of storage space is required to store large amounts of data, research into compression technology is actively underway to manage storage space more efficiently.

데이터의 저장 공간을 줄이는 압축 방법은 크게 손실 압축과 무손실 압축 기법으로 나눌 수 있다. 손실 압축 기법은 데이터를 압축하고 복원시킬 때 데이터의 손실이 발생하는 것으로 사진이나 그림, 영상 등 미디어 데이터를 처리할 때 대표적으로 쓰인다. 무손실 압축 기법은 데이터를 압축하고 복원시킬 때 데이터의 손실이 발생하지 않는 기법으로 텍스트 데이터에 주로 이용된다.Compression methods that reduce the storage space of data can be roughly divided into lossy compression and lossless compression techniques. The lossy compression technique is a data loss that occurs when compressing and restoring data, and is typically used when processing media data such as photos, pictures, and images. The lossless compression technique is a technique in which data loss does not occur when compressing and restoring data, and is mainly used for text data.

대용량의 텍스트 데이터에 주로 적용되는 무손실 압축 기법으로, 런렝스 압축 기법과 사전(dictionary) 기반 압축 기법이 있다. 런렝스 압축 기법은 연속되는 문자열에서 중복된 문자의 개수를 카운트하고, 중복된 문자열을 카운트 값으로 치환함으로써 데이터를 줄이는 기법이다. 그러나, 런렝스 기법은 행(row) 기반의 압축만이 가능하기 때문에, 로그 데이터와 같이 컬럼(column) 또는 필드 단위로 데이터 중복이 빈번하게 발생하는 레코드형 데이터에 상기 런렝스 기법을 적용하는 것은 매우 비효율적이다.As a lossless compression technique mainly applied to large-scale text data, there are a run-length compression technique and a dictionary-based compression technique. The run length compression technique is a technique for reducing data by counting the number of duplicate characters in a continuous character string and replacing the duplicate character string with a count value. However, since the run-length technique can only perform row-based compression, applying the run-length technique to record-type data in which data duplication frequently occurs in column or field units such as log data is not possible. Very inefficient.

또한, 사전 기반 압축 기법은 빈번하게 발생되는 중복 패턴과 이에 매칭되는 압축 값으로 구성된 사전을 이용하여 텍스트 데이터를 압축하는 기법이다. 그러나, 사전 기반 압축 기법은 사전에 중복 패턴을 알고 있는 경우에만 압축 성능을 높일 수 있으며, 미리 사전이 구비되지 않은 경우에는 적용될 수 없다. 뿐만 아니라, 사전 기반 압축 기법은 중복 패턴이 변동되는 데이터에 효과적으로 활용될 수 없다.In addition, the dictionary-based compression technique is a technique for compressing text data using a dictionary composed of frequently occurring overlapping patterns and matching compression values. However, the dictionary-based compression technique can increase the compression performance only when the duplicate pattern is known in advance, and cannot be applied when the dictionary is not provided. In addition, the dictionary-based compression technique cannot be effectively used for data in which overlapping patterns fluctuate.

따라서, 필드 단위로 중복 패턴이 나타내는 대용량 텍스트 데이터 또는 중복 패턴이 변동되는 대용량 텍스트에 대해서도 효율적으로 압축을 수행할 수 있는 방법이 요구된다.Accordingly, there is a need for a method capable of efficiently compressing large text data represented by a duplicate pattern in units of fields or large text in which duplicate patterns fluctuate.

한국공개특허 제1-2015-0031752호(2015.03.25 공개)Korean Patent Publication No. 1-2015-0031752 (published on March 25, 2015)

본 발명이 해결하고자 하는 기술적 과제는, 복수의 필드를 갖는 레코드형 데이터에 대하여 효율적으로 압축을 수행할 수 있는 방법 및 그 방법을 수행하는 장치를 제공하는 것이다.The technical problem to be solved by the present invention is to provide a method for efficiently performing compression on a record type data having a plurality of fields and an apparatus for performing the method.

본 발명의 해결하고자 하는 다른 기술적 과제는, 필드 단위로 중복 패턴이 빈번하게 발생하는 레코드형 데이터에 대하여, 압축 성능을 향상시킬 수 있는 방법 및 그 방법을 수행하는 장치를 제공하는 것이다.Another technical problem to be solved by the present invention is to provide a method for improving compression performance and an apparatus for performing the method for record-type data in which duplicate patterns frequently occur in units of fields.

본 발명의 해결하고자 하는 또 다른 기술적 과제는, 중복 패턴이 변동되는 레코드형 데이터에 대하여, 압축 성능을 향상시킬 수 있는 방법 및 그 방법을 수행하는 장치를 제공하는 것이다.Another technical problem to be solved of the present invention is to provide a method for improving compression performance and an apparatus for performing the method for record-type data in which a duplicate pattern is varied.

본 발명이 해결하고자 하는 또 다른 기술적 과제는, 네트워크 비용을 절감하기 위해, 복수의 필드를 갖는 레코드형 데이터를 효율적으로 전송하는 방법 및 그 방법을 수행하는 장치를 제공하는 것이다.Another technical problem to be solved by the present invention is to provide a method for efficiently transmitting record-type data having a plurality of fields and an apparatus for performing the method in order to reduce network cost.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명의 기술분야에서의 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

상기 기술적 과제를 해결하기 위한, 본 발명의 일 실시예에 따른 데이터 압축 전송 방법은, 컴퓨팅 장치에 의해 수행되는 데이터 압축 전송 방법에 있어서, 복수의 필드를 갖는 레코드의 집합으로 구성된 원본 데이터를 입력받는 단계, 상기 원본 데이터의 각 필드 별로 중복된 필드 값을 제거함으로써, 상기 원본 데이터에 대한 압축 데이터를 획득하는 단계 및 상기 획득된 압축 데이터를 전송하는 단계를 포함할 수 있다. 이때, 상기 전송하는 단계는, 상기 원본 데이터 상에서 서로 인접하여 위치한 필드 값을 그룹핑하여 전송 그룹을 구성하는 단계 및 상기 압축 데이터를 상기 전송 그룹 단위로 구분하여 전송하는 단계를 포함할 수 있다.To solve the above technical problem, a data compression transmission method according to an embodiment of the present invention, in a data compression transmission method performed by a computing device, receives original data composed of a set of records having a plurality of fields. The method may include obtaining compressed data for the original data and transmitting the compressed data by removing duplicate field values for each field of the original data. In this case, the transmitting may include forming a transmission group by grouping field values positioned adjacent to each other on the original data, and transmitting the compressed data by dividing the transmission group into units.

일 실시예에서, 상기 입력된 원본 데이터의 압축 적합도에 기초하여 압축 수행 여부를 결정하는 단계를 더 포함하되, 상기 획득하는 단계는 상기 압축 수행 결정에 응답하여 수행될 수 있다.In one embodiment, further comprising determining whether to perform compression based on the compression suitability of the input original data, the obtaining step may be performed in response to the compression performance determination.

일 실시예에서, 상기 압축 데이터를 획득하는 단계는, 상기 원본 데이터의 각 필드 별로 중복되지 않는 고유 필드 값의 개수를 산출하는 단계, 상기 복수의 필드 중에서 상기 고유 필드 값의 개수가 임계치 미만인 필드를 압축 대상 필드로 선정하는 단계 및 상기 압축 대상 필드에 대해, 중복된 필드 값을 제거하는 단계를 포함할 수 있다.In one embodiment, the step of acquiring the compressed data includes calculating a number of unique field values that do not overlap for each field of the original data, and a field in which the number of the unique field values is less than a threshold value among the plurality of fields. The method may include selecting a field to be compressed and removing duplicate field values for the field to be compressed.

일 실시예에서, 상기 압축 데이터를 획득하는 단계는, 상기 원본 데이터의 제1 필드에 슬라이딩 윈도우(sliding window)을 적용하여 상기 제1 필드에 대한 복수의 중복 패턴을 추출하는 단계, 상기 복수의 중복 패턴 중 제1 중복 패턴에 기초하여 제1 중복 제거를 수행함으로써, 제1 중복 제거율을 산정하는 단계, 상기 복수의 중복 패턴 중 제2 중복 패턴에 기초하여 제2 중복 제거를 수행함으로써, 제2 중복 제거율을 산정하는 단계 및 상기 제1 중복 제거율이 상기 제2 중복 제거율보다 높다는 판정에 응답하여, 상기 제1 중복 제거된 데이터를 상기 제1 필드의 압축 데이터로 결정하는 단계를 포함할 수 있다.In one embodiment, the step of acquiring the compressed data includes extracting a plurality of overlapping patterns for the first field by applying a sliding window to the first field of the original data, and the plurality of duplicates Calculating a first deduplication rate by performing a first deduplication based on a first overlapping pattern among patterns, and performing a second deduplication based on a second overlapping pattern among the plurality of overlapping patterns, thereby causing a second overlapping And determining the first deduplicated data as compressed data of the first field, in response to determining that the removal rate is higher than the second deduplication rate.

일 실시예에서, 상기 복수의 필드는 압축이 적용된 제1 필드와 압축이 적용되지 않은 제2 필드를 포함하고, 상기 획득된 압축 데이터를 전송하는 단계를 더 포함하되, 상기 전송하는 단계는, 상기 제1 필드의 데이터와 상기 제2 필드의 데이터를 구분하여 전송하는 단계를 포함할 수 있다.In one embodiment, the plurality of fields includes a first field to which compression is applied and a second field to which compression is not applied, and further comprising transmitting the obtained compressed data, wherein the transmitting step includes: The method may include transmitting data of the first field and data of the second field separately.

일 실시예에서, 상기 구분하여 전송하는 단계는, 상기 원본 데이터 상에서 제1 전송 그룹의 위치를 가리키는 위치 정보를 상기 제1 전송 그룹과 함께 전송하는 단계를 포함할 수 있다.In one embodiment, the step of separately transmitting may include transmitting location information indicating the location of the first transmission group on the original data together with the first transmission group.

일 실시예에서, 상기 구성된 전송 그룹 중 제1 전송 그룹은 동일 필드에 속한 복수의 필드 값을 포함하고, 상기 구분하여 전송하는 단계는, 상기 복수의 필드 값이 하나의 중복 패턴에 매칭되는지 여부를 판정하는 단계, 매칭되지 않는다는 판정에 응답하여, 중복 패턴을 기초로 상기 제1 전송 그룹을 서브 전송 그룹으로 분리하는 단계 및 상기 제1 전송 그룹을 상기 서브 전송 그룹 단위로 구분하여 전송하는 단계를 포함할 수 있다.In one embodiment, the first transmission group of the configured transmission group includes a plurality of field values belonging to the same field, and the step of transmitting the separated groups determines whether the plurality of field values match one overlapping pattern. Determining, in response to a determination that it does not match, separating the first transmission group into sub-transmission groups based on a duplicate pattern, and transmitting the first transmission group by dividing it into sub-transmission group units. can do.

일 실시예에서, 상기 원본 데이터는 상기 복수의 필드를 구분하기 위한 필드 구분자를 포함하고, 상기 구분하여 전송하는 단계는, 전송 그룹 간에 위치한 제1 필드 구분자와 중복 제거된 필드 값 사이에 위치한 제2 필드 구분자를 제외하고 전송하는 단계를 포함할 수 있다.In one embodiment, the original data includes a field delimiter for classifying the plurality of fields, and the step of dividing and transmitting includes a second field positioned between a first field delimiter located between transmission groups and a duplicated field value. And transmitting, excluding the field separator.

상술한 기술적 과제를 해결하기 위한 본 발명의 다른 실시예에 따른 데이터 압축 장치는, 통신 인터페이스, 하나 이상의 인스트럭션들(instructions)을 포함하는 메모리 및 상기 하나 이상의 인스트럭션들을 실행함으로써, 복수의 필드를 갖는 레코드의 집합으로 구성된 원본 데이터를 입력받고, 상기 원본 데이터의 각 필드 별로 중복된 필드 값을 제거함으로써, 상기 원본 데이터에 대한 압축 데이터를 획득하며, 상기 획득된 압축 데이터를 상기 통신 인터페이스를 통해 전송하는 프로세서를 포함할 수 있다. 이때, 상기 프로세서는, 상기 원본 데이터 상에서 서로 인접하여 위치한 필드 값을 그룹핑하여 전송 그룹을 구성하고, 상기 압축 데이터를 상기 전송 그룹 단위로 구분하여 전송할 수 있다.A data compression device according to another embodiment of the present invention for solving the above-described technical problem, a record having a plurality of fields by executing a communication interface, a memory including one or more instructions, and the one or more instructions A processor that receives original data consisting of a set of data, removes duplicate field values for each field of the original data, obtains compressed data for the original data, and transmits the obtained compressed data through the communication interface It may include. In this case, the processor may configure a transmission group by grouping field values positioned adjacent to each other on the original data, and transmit the compressed data by dividing the transmission data into units of the transmission group.

상술한 기술적 과제를 해결하기 위한 본 발명의 또 다른 실시예에 따른 데이터 압축 컴퓨터 프로그램은, 컴퓨팅 장치와 결합되어, 복수의 필드를 갖는 레코드의 집합으로 구성된 원본 데이터를 입력받는 단계, 상기 원본 데이터의 각 필드 별로 중복된 필드 값을 제거함으로써, 상기 원본 데이터에 대한 압축 데이터를 획득하는 단계 및 상기 획득된 압축 데이터를 전송하는 단계를 실행시키기 위하여, 컴퓨터로 판독가능한 기록매체에 저장될 수 있다. 이때, 상기 전송하는 단계는, 상기 원본 데이터 상에서 서로 인접하여 위치한 필드 값을 그룹핑하여 전송 그룹을 구성하는 단계 및 상기 압축 데이터를 상기 전송 그룹 단위로 구분하여 전송하는 단계를 포함할 수 있다.A data compression computer program according to another embodiment of the present invention for solving the above technical problem is coupled to a computing device, receiving original data composed of a set of records having a plurality of fields, and inputting the original data. By removing the duplicated field value for each field, it may be stored in a computer-readable recording medium to perform the steps of obtaining compressed data for the original data and transmitting the obtained compressed data. In this case, the transmitting may include forming a transmission group by grouping field values positioned adjacent to each other on the original data, and transmitting the compressed data by dividing the transmission group into units.

도 1은 본 발명의 일 실시예에 따른 데이터 분석 시스템을 나타내는 예시적인 구성도이다.
도 2 및 도 3은 발명의 일 실시예에 따른 데이터 압축 전송 장치를 나타내는 예시적인 블록도이다.
도 4는 본 발명의 일 실시예에 따른 압축 수행 여부 결정 방법을 나타내는 예시적인 흐름도이다.
도 5 및 도 6은 본 발명의 몇몇 실시예에서 참조될 수 있는 원본 데이터 중에서 압축에 적합한 데이터를 나타내는 예시도이다.
도 7은 본 발명의 일 실시예에 따른 압축 적합도 검사 방법을 설명하기 위한 예시도이다.
도 8은 본 발명의 몇몇 실시예에서 참조될 수 있는 원본 데이터 중에서 압축에 적합하지 않은 데이터를 나타내는 예시도이다.
도 9는 본 발명의 일 실시예에 따른 데이터 압축 전송 방법을 나타내는 예시적인 흐름도이다.
도 10 내지 도 13은 본 발명이 일 실시예에 따른 중복 패턴 기반 중복 제거 방법을 설명하기 위한 예시도이다.
도 14는 도 5 및 도 6에 도시된 원본 데이터에 중복 제거가 수행됨으로써 획득될 수 있는 압축 데이터를 나타내는 예시도이다.
도 15는 본 발명의 일 실시예에 따른 데이터 전송 방법을 나타내는 예시적인 흐름도이다.
도 16은 본 발명의 일 실시예에 따른 전송 그룹 구성 방법을 설명하기 위한 예시도이다.
도 17은 본 발명의 일 실시예에 따라 수신 측에서 원본 데이터를 복원하는 방법을 설명하기 위한 예시도이다.
도 18 내지 도 20은 본 발명의 일 실시예에 따라 예외적으로 전송 그룹을 분리하는 경우를 설명하기 위한 예시도이다.
도 21은 본 발명의 다양한 실시예들에 따른 장치들을 구현할 수 있는 예시적인 컴퓨팅 장치를 나타내는 하드웨어 구성도이다.1 is an exemplary configuration diagram showing a data analysis system according to an embodiment of the present invention.
2 and 3 are exemplary block diagrams illustrating a data compression transmission apparatus according to an embodiment of the present invention.
4 is an exemplary flowchart illustrating a method for determining whether to perform compression according to an embodiment of the present invention.
5 and 6 are exemplary views showing data suitable for compression among original data that can be referred to in some embodiments of the present invention.
7 is an exemplary view for explaining a compression conformity test method according to an embodiment of the present invention.
8 is an exemplary view showing data that is not suitable for compression among original data that may be referred to in some embodiments of the present invention.
9 is an exemplary flowchart illustrating a data compression transmission method according to an embodiment of the present invention.
10 to 13 are exemplary views for explaining a method for removing duplicates based on a duplicate pattern according to an embodiment of the present invention.
14 is an exemplary view showing compressed data that can be obtained by performing deduplication on the original data shown in FIGS. 5 and 6.
15 is an exemplary flowchart illustrating a data transmission method according to an embodiment of the present invention.
16 is an exemplary diagram for explaining a method of configuring a transport group according to an embodiment of the present invention.
17 is an exemplary view for explaining a method of restoring original data at a receiving side according to an embodiment of the present invention.
18 to 20 are exemplary diagrams for explaining a case in which transmission groups are exceptionally separated according to an embodiment of the present invention.
21 is a hardware configuration diagram illustrating an exemplary computing device capable of implementing devices according to various embodiments of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Advantages and features of the present invention, and methods for achieving them will be clarified with reference to embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the embodiments allow the disclosure of the present invention to be complete, and common knowledge in the art to which the present invention pertains It is provided to completely inform the person having the scope of the invention, and the present invention is only defined by the scope of the claims.

각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략한다.It should be noted that in adding reference numerals to the components of each drawing, the same components have the same reference numerals as possible even though they are displayed on different drawings. In addition, in describing the present invention, when it is determined that detailed descriptions of related well-known configurations or functions may obscure the subject matter of the present invention, detailed descriptions thereof will be omitted.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다.Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used in a sense that can be commonly understood by those skilled in the art to which the present invention pertains. In addition, terms defined in the commonly used dictionary are not ideally or excessively interpreted unless specifically defined. The terminology used herein is for describing the embodiments and is not intended to limit the present invention. In this specification, the singular form also includes the plural form unless otherwise specified in the phrase.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 제1, 제2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 또는 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다.In addition, in describing the components of the present invention, terms such as first, second, A, B, (a), and (b) may be used. These terms are only for distinguishing the component from other components, and the nature, order, or order of the component is not limited by the term. When a component is described as being "connected", "coupled" or "connected" to another component, that component may be directly connected to or connected to the other component, but another component between each component It should be understood that elements may be "connected", "coupled" or "connected".

명세서에서 사용되는 "포함한다 (comprises)" 및/또는 "포함하는 (comprising)"은 언급된 구성 요소, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성 요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.As used herein, "comprises" and / or "comprising" refers to the components, steps, operations and / or elements mentioned above, the presence of one or more other components, steps, operations and / or elements. Or do not exclude additions.

본 명세서에 대한 설명에 앞서, 본 명세서에서 사용되는 몇몇 용어들에 대하여 명확하게 하기로 한다.Prior to the description of the present specification, some terms used in the specification will be clarified.

본 명세서에서, 레코드(record) 또는 레코드 데이터란, 하나 이상의 필드(field)를 갖는 단위 데이터이다. 이때, 상기 필드는 데이터의 가장 작은 논리적 단위(즉, 의미를 갖는 최소 단위의 데이터)이며, 당해 기술 분야에서 속성(attribute), 컬럼(column), 항목(item) 등의 용어와 혼용되어 사용될 수 있다. 상기 레코드 데이터는 각 필드를 구분하기 위해 필드 구분자(delimiter or separator)를 포함할 수 있다. 그러나, 본 발명의 기술적 범위가 이에 한정되는 것은 아니다.In the present specification, a record or record data is unit data having one or more fields. At this time, the field is the smallest logical unit of data (ie, the smallest unit having meaning) and can be used interchangeably with terms such as attribute, column, and item in the art. have. The record data may include a delimiter or separator to distinguish each field. However, the technical scope of the present invention is not limited thereto.

본 명세서에서, 인스트럭션(instruction)이란, 기능을 기준으로 묶인 일련의 명령어들로서 컴퓨터 프로그램의 구성 요소이자 프로세서에 의해 실행되는 것을 가리킨다.In the present specification, an instruction (instruction) is a set of instructions grouped by function, refers to a component of a computer program and executed by a processor.

이하, 본 발명의 몇몇 실시예들에 대하여 첨부된 도면에 따라 상세하게 설명한다.Hereinafter, some embodiments of the present invention will be described in detail according to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 데이터 분석 시스템을 나타내는 예시적인 구성도이다.1 is an exemplary configuration diagram showing a data analysis system according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 상기 데이터 분석 시스템은 하나의 데이터 소스(1-1 내지 1-n), 데이터 압축 전송 장치(100) 및 분석 장치(3)를 포함할 수 있다. 단, 이는 본 발명의 목적을 달성하기 위한 바람직한 실시예일뿐이며, 필요에 따라 일부 구성 요소가 추가되거나 삭제될 수 있음은 물론이다. 또한, 도 1에 도시된 데이터 분석 시스템의 각각의 구성 요소들은 기능적으로 구분되는 기능 요소들을 나타낸 것으로서, 복수의 구성 요소가 실제 물리적 환경에서는 서로 통합되는 형태로 구현될 수도 있음에 유의한다.As shown in FIG. 1, the data analysis system may include one data source 1-1 to 1-n, a data compression transmission device 100 and an analysis device 3. However, this is only a preferred embodiment for achieving the object of the present invention, and of course, some components may be added or deleted as necessary. In addition, it is noted that each component of the data analysis system illustrated in FIG. 1 is functionally divided functional elements, and a plurality of components may be implemented in an integrated form in a physical environment.

또한, 실제 물리적 환경에서 상기 각각의 구성 요소들은 복수의 세부 기능 요소로 분리되는 형태로 구현될 수도 있다. 예컨대, 데이터 압축 전송 장치(100)의 제1 기능은 제1 컴퓨팅 장치에서 구현되고, 제2 기능은 제2 컴퓨팅 장치에서 구현될 수도 있다. 이하, 상기 각각의 구성 요소에 대하여 설명한다. 이하에서는, 설명의 편의상, 데이터 압축 전송 장치(100)를 압축 장치(100)로 약칭하도록 한다.In addition, in the physical environment, each of the components may be implemented in a form of being divided into a plurality of detailed functional elements. For example, the first function of the data compression transmission device 100 may be implemented in the first computing device, and the second function may be implemented in the second computing device. Hereinafter, each of the components will be described. Hereinafter, for convenience of description, the data compression transmission device 100 will be abbreviated as the compression device 100.

상기 데이터 분석 시스템에서, 적어도 하나의 데이터 소스(1-1 내지 1-n)는 분석 대상 데이터를 제공하는 임의의 장치 또는 저장소이다. 이때, 상기 분석 대상 데이터는 복수의 필드를 갖는 레코드의 집합으로 구성될 수 있다. 예를 들어, 상기 분석 대상 데이터는 로그 데이터 등을 포함할 수 있으나, 본 발명의 기술적 범위가 상기 열거된 예시에 한정되는 것은 아니다. 상기 레코드 데이터의 예시는 도 5 등의 도면을 참조하도록 한다.In the data analysis system, at least one data source (1-1 to 1-n) is any device or storage that provides data to be analyzed. At this time, the analysis target data may be composed of a set of records having a plurality of fields. For example, the data to be analyzed may include log data, etc., but the technical scope of the present invention is not limited to the examples listed above. For examples of the record data, refer to the drawings of FIG. 5 and the like.

상기 데이터 분석 시스템에서, 압축 장치(100)는 데이터 압축 및 전송 기능이 구비된 컴퓨팅 장치이다. 여기서, 상기 컴퓨팅 장치는, 노트북, 데스크톱(desktop), 랩탑(laptop) 등이 될 수 있으나, 이에 국한되는 것은 아니며 컴퓨팅 기능이 구비된 모든 종류의 장치를 포함할 수 있다. 다만, 대용량의 데이터를 다루는 환경이라면, 압축 장치(100)는 고성능의 서버급 컴퓨팅 장치로 구현되는 것이 바람직할 수 있다. 상기 컴퓨팅 장치의 일 예시는 도 21을 참조하도록 한다.In the data analysis system, the compression device 100 is a computing device equipped with data compression and transmission functions. Here, the computing device may be a laptop, a desktop, a laptop, or the like, but is not limited thereto, and may include all types of devices equipped with computing functions. However, in an environment that handles a large amount of data, the compression device 100 may be preferably implemented as a high-performance server-class computing device. An example of the computing device will be referred to FIG. 21.

압축 장치(100)는 데이터 소스(1-1 내지 1-n)로부터 수집된 대용량의 원본 데이터를 압축함으로써 압축 데이터를 획득하고, 상기 압축 데이터를 분석 장치(3)로 전송한다. 그렇게 함으로써, 분석 장치(3)의 저장 공간이 효율적으로 활용될 수 있고, 전송에 소요되는 네트워크 비용이 절감될 수 있다.The compression device 100 obtains compressed data by compressing a large amount of original data collected from the data sources 1-1 to 1-n, and transmits the compressed data to the analysis device 3. By doing so, the storage space of the analysis device 3 can be efficiently utilized, and the network cost for transmission can be reduced.

본 발명의 실시예에 따르면, 압축 장치(100)는 필드 별로 중복 패턴을 검출하고, 검출된 중복 패턴에 기초하여 원본 데이터에서 중복된 필드 값을 제거함으로써, 상기 압축 데이터를 획득할 수 있다. 본 실시예에 따르면, 필드 단위로 중복 패턴이 빈번하게 발생되는 데이터(e.g. 로그 데이터)에 대해 압축 성능이 크게 향상될 수 있다. 나아가, 실시간으로 검출된 중복 패턴에 기초하여 데이터 압축이 수행되기 때문에, 본 실시예에 따른 압축 기능은 중복 패턴이 변동되는 데이터에 대해서도 효과적으로 활용될 수 있다.According to an embodiment of the present invention, the compression apparatus 100 may obtain the compressed data by detecting a duplicate pattern for each field and removing duplicate field values from the original data based on the detected duplicate pattern. According to the present embodiment, compression performance may be greatly improved for data (e.g. log data) in which duplicate patterns are frequently generated in units of fields. Furthermore, since data compression is performed based on a duplicate pattern detected in real time, the compression function according to this embodiment can be effectively utilized even for data in which a duplicate pattern is changed.

또한, 본 발명의 실시예에 따르면, 압축 장치(100)는 압축 데이터에 포함된 복수의 필드 값을 그룹핑하여 전송 그룹을 구성하고, 압축 데이터를 전송 그룹 단위로 구분하여 전송할 수 있다. 그렇게 함으로써, 필드 구분자 및 전송 헤더에 따른 데이터 전송 비용이 크게 절감될 수 있다.In addition, according to an embodiment of the present invention, the compression device 100 may group a plurality of field values included in the compressed data to form a transmission group, and divide and transmit the compressed data in units of transmission groups. By doing so, the data transmission cost according to the field separator and the transmission header can be greatly reduced.

참고로, 압축 프로세스의 신속한 처리를 위해 압축 장치(100)는 복수의 프로세서를 통해 병렬 처리가 가능하며, 복수의 압축 장치(100)를 통해 분산처리를 수행할 수도 있다For reference, for rapid processing of the compression process, the compression device 100 may be processed in parallel through a plurality of processors, and distributed processing may be performed through the plurality of compression devices 100.

압축 장치(100)가 원본 데이터를 압축하는 방법과 압축 데이터를 전송하는 방법에 대한 자세한 설명은 도 3 이하의 도면을 참조하여 후술하도록 한다.A detailed description of the method of compressing the original data by the compression apparatus 100 and the method of transmitting the compressed data will be described later with reference to the drawings of FIG. 3 and below.

상기 데이터 분석 시스템에서, 분석 장치(3)는 대용량 데이터에 대한 분석을 수행하는 장치이다. 분석 장치(3)는 압축 장치(100)로부터 압축 데이터를 수신하고, 상기 압축 데이터를 저장 공간에 저장할 수 있다. 그렇게 함으로써, 분석 장치(3)의 저장 공간이 효과적으로 활용될 수 있다. 또는, 분석 장치(3)는 상기 압축 데이터를 복원하여 원본 데이터를 재구성하고, 상기 원본 데이터에 대한 분석을 수행할 수 있다. 본 발명의 실시예에 따른 압축 기능은 무손실 기반 압축 기법이므로, 분석 장치(200)는 손실 없이 원본 데이터를 완전하게 복원할 수 있다. 이에 따라, 정확하게 대용량 데이터에 대한 분석이 수행될 수 있다.In the data analysis system, the analysis device 3 is a device that performs analysis on large data. The analysis device 3 may receive compressed data from the compression device 100 and store the compressed data in a storage space. By doing so, the storage space of the analysis device 3 can be effectively utilized. Alternatively, the analysis device 3 may reconstruct the original data by restoring the compressed data, and perform analysis on the original data. Since the compression function according to an embodiment of the present invention is a lossless based compression technique, the analysis device 200 can completely restore the original data without loss. Accordingly, analysis of large-capacity data can be accurately performed.

도 1에 도시된 구성 요소 중 적어도 일부는 네트워크를 통해 통신할 수 있다. 여기서, 상기 네트워크는 근거리 통신망(Local Area Network; LAN), 광역 통신망(Wide Area Network; WAN), 이동 통신망(mobile radio communication network), Wibro(Wireless Broadband Internet) 등과 같은 모든 종류의 유/무선 네트워크로 구현될 수 있다.At least some of the components shown in FIG. 1 may communicate over a network. Here, the network is a wired / wireless network of any kind, such as a local area network (LAN), a wide area network (WAN), a mobile radio communication network, a Wibro (Wireless Broadband Internet), and the like. Can be implemented.

지금까지 도 1을 참조하여 본 발명의 일 실시예에 따른 데이터 분석 시스템에 대하여 설명하였다. 전술한 바와 같이, 본 발명의 실시예에 따른 압축 장치(100)는 대용량 데이터를 다루는 환경에 적용되어 저장 공간의 효율성을 증대시킬 수 있다. 뿐만 아니라, 본 발명의 실시예에 따른 압축 기능은 일반적인 데이터 전송 환경에(즉, 전송 장치 측)에 적용되어, 네트워크 비용을 절감시키는데 일조할 수 있다.So far, a data analysis system according to an embodiment of the present invention has been described with reference to FIG. 1. As described above, the compression device 100 according to an embodiment of the present invention can be applied to an environment that handles large amounts of data to increase the efficiency of storage space. In addition, the compression function according to an embodiment of the present invention can be applied to a general data transmission environment (ie, a transmission device side), thereby helping to reduce network cost.

이하에서는, 도 2 및 도 3을 참조하여 본 발명의 일 실시예에 따른 압축 장치(100)의 구성 및 동작에 대하여 설명하도록 한다.Hereinafter, the configuration and operation of the compression device 100 according to an embodiment of the present invention will be described with reference to FIGS. 2 and 3.

도 2 및 도 3은 본 발명의 일 실시예에 따른 압축 장치(100)를 나타내는 예시적인 블록도이다.2 and 3 are exemplary block diagrams showing a compression device 100 according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 압축 장치(100)는 입력부(110), 검사부(130), 압축부(150) 및 전송부(170)를 포함할 수 있다. 다만, 도 2에는 본 발명의 실시예와 관련 있는 구성요소들만이 도시되어 있다. 따라서, 본 발명이 속한 기술분야의 통상의 기술자라면 도 2에 도시된 구성요소들 외에 다른 범용적인 구성 요소들이 더 포함될 수 있음을 알 수 있다. 또한, 도 2에 도시된 압축 장치(100)의 각각의 구성 요소들은 기능적으로 구분되는 기능 요소들을 나타낸 것으로서, 복수의 구성 요소가 실제 물리적 환경에서는 서로 통합되는 형태로 구현될 수도 있음에 유의한다. 이하, 각 구성 요소에 대하여 설명한다.As shown in FIG. 2, the compression device 100 may include an input unit 110, an inspection unit 130, a compression unit 150, and a transmission unit 170. However, only components related to the embodiment of the present invention are illustrated in FIG. 2. Therefore, it can be seen that a person skilled in the art to which the present invention pertains may further include other general-purpose components in addition to the components shown in FIG. 2. In addition, it is noted that each of the components of the compression device 100 shown in FIG. 2 is functionally divided functional elements, and a plurality of components may be implemented in an integrated form in a physical environment. Hereinafter, each component will be described.

입력부(110)는 원본 데이터(11)를 입력받는다. 전술한 바와 같이, 원본 데이터(11)는 복수의 필드를 갖는 레코드의 집합이다.The input unit 110 receives the original data 11. As described above, the original data 11 is a set of records having a plurality of fields.

다음으로, 검사부(130)는 원본 데이터(11)에 대하여 압축 적합도를 검사한다. 또한, 검사부(130)는 검사 결과를 바탕으로, 원본 데이터(11)에 대해 압축을 수행할지 여부를 결정할 수 있다. 압축 수행 결정에 응답하여, 원본 데이터(11)는 압축부(150)로 입력될 수 있다. 이와 반대로, 압축 미수행 결정에 응답하여, 원본 데이터(11)는 곧바로 전송부(170)로 입력될 수 있다.Next, the inspection unit 130 checks the compression suitability for the original data 11. In addition, the inspection unit 130 may determine whether to perform compression on the original data 11 based on the inspection result. In response to the decision to perform compression, the original data 11 may be input to the compression unit 150. Conversely, in response to the decision to not perform compression, the original data 11 may be directly input to the transmission unit 170.

또한, 검사부(130)는 원본 데이터(11)의 복수의 필드 중에서 압축 대상 필드를 선정할 수도 있다. 상기 압축 대상 필드는 압축 적합도 검사 결과에 기초하여 선정될 수 있는데, 이에 대한 설명은 도 7을 참조하여 후술하도록 한다.In addition, the inspection unit 130 may select a compression target field among a plurality of fields of the original data 11. The compression target field may be selected based on a compression suitability test result, which will be described later with reference to FIG. 7.

중복된 설명을 배제하기 위해, 검사부(130)의 동작에 대한 보다 자세한 설명은 도 4 내지 도 8을 참조하여 후술하도록 한다.In order to exclude duplicate description, a more detailed description of the operation of the inspection unit 130 will be described later with reference to FIGS. 4 to 8.

다음으로, 압축부(150)는 필드 별로 원본 데이터(11)에 대한 압축을 수행한다. 압축 대상 필드가 지정된 경우, 압축부(150)는 압축 대상 필드에 대해서만 압축을 수행할 수도 있다. 압축 수행 결과로, 원본 데이터(11)에 대한 압축 데이터가 획득된다. 도 3에 도시된 바와 같이, 압축부(150)는 중복 패턴 검출부(151)와 중복 제거부(153)를 포함할 수 있다.Next, the compression unit 150 compresses the original data 11 for each field. When the compression target field is specified, the compression unit 150 may compress only the compression target field. As a result of the compression, compressed data for the original data 11 is obtained. As shown in FIG. 3, the compression unit 150 may include a duplicate pattern detection unit 151 and a duplicate removal unit 153.

중복 패턴 검출부(151)는 원본 데이터(11)의 필드 별로 중복 패턴을 검출한다.The duplicate pattern detection unit 151 detects duplicate patterns for each field of the original data 11.

다음으로, 중복 제거부(153)는 검출된 중복 패턴을 이용하여 필드 별로 원본 데이터(11)에 포함된 중복 필드 값을 제거한다.Next, the duplicate removal unit 153 removes duplicate field values included in the original data 11 for each field by using the detected duplicate pattern.

중복된 설명을 배제하기 위해, 압축부(150)의 동작에 대한 보다 자세한 설명은 도 9 내지 도 14를 참조하여 후술하도록 한다.In order to exclude duplicate description, a more detailed description of the operation of the compression unit 150 will be described later with reference to FIGS. 9 to 14.

다음으로, 전송부(170)는 데이터(e.g. 원본 데이터, 압축 데이터)를 전송한다. 도 3에 도시된 바와 같이, 전송부(170)는 전송 그룹 구성부(171)와 데이터 전송부(173)를 포함할 수 있다.Next, the transmission unit 170 transmits data (e.g. original data, compressed data). 3, the transmission unit 170 may include a transmission group configuration unit 171 and a data transmission unit 173.

전송 그룹 구성부(171)는 압축 데이터에 포함된 적어도 하나의 필드 값을 그룹핑하여 전송 그룹을 구성한다.The transmission group configuration unit 171 configures a transmission group by grouping at least one field value included in compressed data.

다음으로, 데이터 전송부(173)는 전송 그룹 단위로 전송 헤더를 생성하여 부가하고, 전송 그룹 단위로 압축 데이터(13)를 전송한다. 이때, 전송 헤더에는 해당 전송 그룹의 위치 정보가 포함된다. 상기 위치 정보는, 수신 장치(미도시)가 수신한 압축 데이터를 기초로 원본 데이터를 복원하기 위해 이용된다.Next, the data transmission unit 173 generates and adds a transmission header in units of transmission groups, and transmits compressed data 13 in units of transmission groups. At this time, the transmission header includes location information of the corresponding transmission group. The location information is used to restore the original data based on the compressed data received by the receiving device (not shown).

중복된 설명을 배제하기 위해, 전송부(170)의 동작에 대한 보다 자세한 설명은 도 15 내지 도 20을 참조하여 후술하도록 한다.In order to exclude duplicate description, a more detailed description of the operation of the transmission unit 170 will be described later with reference to FIGS. 15 to 20.

참고로, 도 2 및 도 3에는 도시되어 있지 않으나, 압축 장치(100)는 분할부(미도시)를 더 포함할 수 있다. 분할부(미도시)는 입력된 원본 데이터(11)를 소정의 개수의 레코드를 갖는 부분 데이터로 분할할 수 있다. 이와 같은 경우, 검사부(130), 압축부(150) 및 전송부(170)는 각 부분 데이터에 대하여 전술한 동작을 수행할 수 있다.For reference, although not shown in FIGS. 2 and 3, the compression device 100 may further include a division (not shown). The division unit (not shown) may divide the input original data 11 into partial data having a predetermined number of records. In this case, the inspection unit 130, the compression unit 150, and the transmission unit 170 may perform the above-described operation on each partial data.

한편, 도 2 또는 도 3에 도시된 모든 구성 요소가 압축 장치(100)를 구현하기 위한 필수적인 구성 요소는 아닐 수도 있음에 유의하여야 한다. 가령, 본 발명의 다른 실시예에 따른 압축 장치(100)는 도 2 또는 도 3에 도시된 구성 요소 중 일부만을 이용하여 구현될 수도 있다.Meanwhile, it should be noted that not all components shown in FIG. 2 or 3 may be necessary components for implementing the compression device 100. For example, the compression device 100 according to another embodiment of the present invention may be implemented using only some of the components shown in FIG. 2 or 3.

도 2 및 도 3에 도시된 압축 장치(100)의 각 구성 요소는 소프트웨어(Software) 또는, FPGA(Field Programmable Gate Array)나 ASIC(Application-Specific Integrated Circuit)과 같은 하드웨어(Hardware)를 의미할 수 있다. 그렇지만, 상기 구성 요소들은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니며, 어드레싱(Addressing)할 수 있는 저장 매체에 있도록 구성될 수도 있고, 하나 또는 그 이상의 프로세서들을 실행시키도록 구성될 수도 있다. 상기 구성 요소들 안에서 제공되는 기능은 더 세분화된 구성 요소에 의하여 구현될 수 있으며, 복수의 구성 요소들을 합하여 특정한 기능을 수행하는 하나의 구성 요소로 구현될 수도 있다.Each component of the compression device 100 shown in FIGS. 2 and 3 may refer to software or hardware such as a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC). have. However, the above components are not limited to software or hardware, and may be configured to be in an addressable storage medium, or may be configured to execute one or more processors. The functions provided in the above components may be implemented by more detailed components, or may be implemented as a single component that performs a specific function by combining a plurality of components.

지금까지 도 2 및 도 3을 참조하여 본 발명이 일 실시예에 따른 압축 장치(100)의 구성 및 동작에 대하여 설명하였다. 이하에서는, 도 4 내지 도 20을 참조하여 본 발명의 다양한 실시예들에 따른 방법들에 대하여 설명하도록 한다.So far, the configuration and operation of the compression apparatus 100 according to an embodiment of the present invention have been described with reference to FIGS. 2 and 3. Hereinafter, methods according to various embodiments of the present invention will be described with reference to FIGS. 4 to 20.

이하에서 후술된 본 발명의 실시예들에 따른 방법의 각 단계는, 컴퓨팅 장치에 의해 수행될 수 있다. 다시 말하면, 상기 방법의 각 단계는 컴퓨팅 장치에 의해 수행되는 하나 이상의 인스트럭션들로 구현될 수 있다. 물리적으로 하나의 컴퓨팅 장치에 의하여 상기 방법의 모든 단계가 실행될 수도 있을 것이나, 상기 방법의 제1 단계들은 제1 컴퓨팅 장치에 의하여 수행되고, 제2 단계들은 제2 컴퓨팅 장치에 의하여 수행될 수도 있다. 다만, 이해의 편의를 제공하기 위해, 상기 방법의 모든 단계가 압축 장치(100)에 의하여 수행되는 것을 가정하여 설명하도록 한다.Each step of the method according to the embodiments of the present invention described below may be performed by a computing device. In other words, each step of the method can be implemented with one or more instructions performed by a computing device. Physically, all steps of the method may be executed by one computing device, but the first steps of the method may be performed by the first computing device, and the second steps may be performed by the second computing device. However, in order to provide convenience for understanding, it will be described on the assumption that all the steps of the method are performed by the compression device 100.

도 4는 본 발명의 일 실시예에 따른 압축 수행 여부 결정 방법을 나타내는 예시적인 흐름도이다. 단, 이는 본 발명의 목적을 달성하기 위한 바람직한 실시예일뿐이며, 필요에 따라 일부 단계가 추가되거나 삭제될 수 있음은 물론이다.4 is an exemplary flowchart illustrating a method for determining whether to perform compression according to an embodiment of the present invention. However, this is only a preferred embodiment for achieving the object of the present invention, and of course, some steps may be added or deleted as necessary.

도 4에 도시된 바와 같이, 압축 장치(100)는 입력된 원본 데이터에 대하여 압축 적합도 검사를 수행하고, 압축에 적합한 데이터로 판정된 경우에 한하여 압축 프로세스를 수행할 수 있다(S10 내지 S70). 불필요하게 컴퓨팅 리소스가 낭비되는 것을 방지하기 위해서이다.As shown in FIG. 4, the compression apparatus 100 may perform a compression suitability test on the input original data and perform a compression process only when it is determined that the data is suitable for compression (S10 to S70). This is to prevent unnecessary waste of computing resources.

단계 S30 및 S50에서 원본 데이터의 압축 적합도 검사를 통해 압축 수행 여부를 판정하는 구체적인 방식은 실시예에 따라 달라질 수 있다.In S30 and S50, a specific method of determining whether to perform compression through a compression suitability check of original data may vary according to embodiments.

제1 실시예에서, 압축 장치(100)는 범주형 데이터를 갖는 필드의 개수에 기초하여 압축 수행 여부를 결정할 수 있다. 일반적으로, 연속형 데이터보다는 범주형 데이터가 많을수록 압축 효과가 향상되기 때문이다. 구체적으로, 압축 장치(100)는 복수의 필드 중에서 범주형 데이터를 갖는 필드의 개수를 카운팅하고, 상기 카운팅된 필드의 개수가 임계치 이상이라는 판정에 응답하여 압축 수행 결정을 내릴 수 있다. 반대의 경우, 압축 장치(100)는 압축 미수행 결정을 내릴 수 있다.In the first embodiment, the compression device 100 may determine whether to perform compression based on the number of fields having categorical data. In general, the more categorical data is, rather than continuous data, the better the compression effect. Specifically, the compression apparatus 100 may count the number of fields having categorical data among a plurality of fields, and make a decision to perform compression in response to a determination that the number of counted fields is greater than or equal to a threshold. In the opposite case, the compression device 100 may make a decision not to perform compression.

제2 실시예에서, 압축 장치(100)는 각 필드 별로 산출된 고유 필드 값의 개수에 기초하여 압축 수행 여부를 결정할 수 있다. 구체적으로, 압축 장치(100)는 각 필드 별로 상기 고유 필드 값의 개수가 제1 임계치 미만인 필드의 개수를 카운팅하고, 상기 카운팅된 필드의 개수가 제2 임계치 이상이라는 판정에 응답하여, 압축 수행 결정을 내릴 수 있다. 보다 이해의 편의를 제공하기 위해, 도 5 내지 도 7을 참조하여 상기 제2 실시예에 대하여 부연 설명하도록 한다.In the second embodiment, the compression device 100 may determine whether to perform compression based on the number of unique field values calculated for each field. Specifically, the compression apparatus 100 counts the number of fields in which the number of the unique field values is less than the first threshold value for each field, and determines the compression performance in response to the determination that the number of counted fields is greater than or equal to the second threshold value. Can lower In order to provide more convenience, the second embodiment will be described in detail with reference to FIGS. 5 to 7.

도 5 및 도 6은 원본 데이터의 일 예를 도시하고 있다. 도 5에 도시된 바와 같이, 원본 데이터(21)는 서로 다른 필드를 구분하기 위해 필드 구분자(",")를 포함할 수 있다. 물론, 필드를 구분할 수만 있다면, 원본 데이터에 반드시 필드 구분자가 포함될 필요는 없다. 도 6은 이해의 편의를 위해 필드 구분자를 생략하고 원본 데이터(21)를 테이블 형태의 데이터(23)로 도시한 것이다.5 and 6 show an example of original data. As illustrated in FIG. 5, the original data 21 may include field separators (“,”) to distinguish different fields. Of course, if the field can be distinguished, the original data does not necessarily include the field separator. FIG. 6 shows the original data 21 as table data 23 by omitting the field separator for convenience of understanding.

도 7에 도시된 바와 같이, 원본 데이터(23)의 각 필드(31 내지 37) 별로 중복을 제거(e.g. 집합 연산)함으로써 각 필드 별로 고유 필드 값과 고유 필드 값의 개수가 산출될 수 있다. 가령, 제1 필드(31)의 경우, 고유 필드 값은 "Product1"과 "Product2"가 되고, 고유 필드 값의 개수는 "2"가 된다. 필드 별로 고유 필드 값의 개수를 산출하는 이유는, 압축에 효과적인 필드를 미리 판단하기 위해서이다. 데이터의 개수가 많은데, 고유 필드 값의 개수가 적다는 것은 그만큼 데이터 중복이 많다는 것을 의미하고, 이는 곧 잠재적으로 압축될 부분이 많다는 것을 의미하기 때문이다.As shown in FIG. 7, by eliminating duplicates (e.g. set operation) for each field 31 to 37 of the original data 23, a unique field value and a number of unique field values can be calculated for each field. For example, in the case of the first field 31, the unique field values are "Product1" and "Product2", and the number of unique field values is "2". The reason for calculating the number of unique field values for each field is to determine in advance a field effective for compression. The number of data is large, but the number of unique field values means that there is a lot of data duplication, which means that there is a lot of potential compression.

따라서, 압축 장치(100)는 고유 필드 값의 개수가 제1 임계치 미만인 필드(즉, 압축에 효과적인 필드)의 개수가 제2 임계치 이상인 경우(즉, 압축에 효과적인 필드가 많은 경우), 압축 수행 결정을 내릴 수 있다. 반대의 경우, 압축 장치(100)는 압축 미수행 결정을 내릴 수 있다.Accordingly, when the number of unique field values is less than the first threshold (ie, the field effective for compression) is greater than or equal to the second threshold (ie, when there are many effective fields for compression), the compression apparatus 100 determines compression performance. Can lower In the opposite case, the compression device 100 may make a decision not to perform compression.

여기서, 상기 제1 임계치 및 상기 제2 임계치는 기 설정된 고정 값 또는 상황에 따라 변동되는 변동 값일 수 있다. 가령, 압축 장치(100)의 가용 리소스가 많거나 또는 컴퓨팅 성능이 우수할수록, 상기 제1 임계치는 더 큰 값으로 결정되고, 상기 제2 임계치는 더 작은 값으로 결정될 수 있다.Here, the first threshold value and the second threshold value may be preset fixed values or fluctuation values that fluctuate according to circumstances. For example, as the available resources of the compression device 100 are high or the computing power is excellent, the first threshold may be determined as a larger value, and the second threshold may be determined as a smaller value.

몇몇 실시예에서, 상기 고유 필드 값의 개수를 기초로 압축 대상 필드와 그렇지 않은 필드가 구분될 수 있다. 가령, 압축 장치(100)는 복수의 필드 중에서 고유 필드 값의 개수가 임계치 미만인 필드만을 압축 대상 필드로 선정하고, 상기 압축 대상 필드에 대해서만 압축 프로세스를 수행할 수 있다. 즉, 잠재적으로 압축 효과가 우수할 것으로 판단되는 필드에 대해서만 압축이 수행됨으로써, 컴퓨팅 비용은 감소되고, 비용 대비 압축 효과는 향상될 수 있다.In some embodiments, a compression target field and a non-compression field may be distinguished based on the number of unique field values. For example, the compression device 100 may select only a field in which the number of unique field values is less than a threshold among a plurality of fields as a compression target field, and may perform a compression process only on the compression target field. That is, since compression is performed only on a field that is potentially determined to have excellent compression effect, computing cost can be reduced and the cost-to-compression effect can be improved.

제3 실시예에서, 전술한 제1 실시예 및 제2 실시예는 원본 데이터에서 샘플링된 일부 데이터에 대하여 수행될 수 있다. 이때, 샘플링을 수행하는 방식은 어떠한 방식이 되더라도 무방하다. 샘플링되는 레코드의 개수는 기 설정된 고정 값이 될 수도 있고, 상황에 따라 변동되는 변동 값이 될 수도 있다. 가령, 압축 적합도 검사의 정확도를 향상시키거나, 가용한 컴퓨팅 리소스가 충분한 경우에는, 샘플링되는 레코드의 개수가 증가될 수 있다. 본 실시예에 따르면, 일부 데이터에 대해서만 압축 적합도를 검사함으로써, 검사에 소요되는 컴퓨팅 비용이 절감될 수 있다.In the third embodiment, the above-described first and second embodiments may be performed on some data sampled from original data. At this time, the method of performing sampling may be any method. The number of records to be sampled may be a preset fixed value or a variable value that fluctuates depending on the situation. For example, to improve the accuracy of compression conformance checking, or when sufficient computing resources are available, the number of records to be sampled may be increased. According to the present exemplary embodiment, by checking compression suitability for only some data, computing cost required for the inspection can be reduced.

제4 실시예에서, 전술한 실시예들의 다양한 조합에 기초하여 압축 적합도 검사가 수행되고, 원본 데이터에 대한 압축 수행 여부가 결정될 수 있다.In the fourth embodiment, compression suitability check is performed based on various combinations of the above-described embodiments, and whether to perform compression on the original data may be determined.

전술한 실시예들에 따르면, 도 8에 도시된 데이터(39)와 같이, 압축 효과가 미미할 것으로 예측되는 데이터(e.g. 데이터 중복이 거의 없는 데이터)의 경우, 압축 프로세스를 생략함으로써, 컴퓨팅 리소스가 보다 효율적으로 활용될 수 있다.According to the above-described embodiments, in the case of data predicted to have a small compression effect (eg, data with little data duplication), such as the data 39 shown in FIG. 8, by omitting the compression process, computing resources are more It can be used effectively.

참고로, 전술한 단계 S10 내지 S70 중에서, 단계 S10은 입력부(110)에 의해 수행되고, 단계 S30 및 S50은 검사부(130)에 의해 수행되며, 단계 S70은 압축부(150)에 의해 수행될 수 있다.For reference, among the above-described steps S10 to S70, step S10 is performed by the input unit 110, steps S30 and S50 are performed by the inspection unit 130, and step S70 can be performed by the compression unit 150 have.

압축 수행 결정에 응답하여, 본격적으로 원본 데이터에 대한 압축 프로세스가 수행될 수 있다. 이하, 도 9 내지 도 14를 참조하여 상기 압축 프로세스에 대한 설명을 이어가도록 한다.In response to the decision to perform compression, a compression process for the original data may be performed in earnest. Hereinafter, the description of the compression process will be continued with reference to FIGS. 9 to 14.

도 9는 본 발명의 일 실시예에 따른 데이터 압축 전송 방법을 나타내는 예시적인 흐름도이다. 단, 이는 본 발명의 목적을 달성하기 위한 바람직한 실시예일뿐이며, 필요에 따라 일부 단계가 추가되거나 삭제될 수 있음은 물론이다.9 is an exemplary flowchart illustrating a data compression transmission method according to an embodiment of the present invention. However, this is only a preferred embodiment for achieving the object of the present invention, and of course, some steps may be added or deleted as necessary.

도 9에 도시된 바와 같이, 상기 데이터 압축 방법은 필드 별로 복수의 중복 패턴을 추출하는 단계 S110에서 시작된다. 본 단계 S110에서, 압축 장치(100)는 슬라이딩 윈도우(sliding window) 방식으로 해당 필드에서 중복 가능한 다양한 패턴을 추출할 수 있는데, 이에 대하여 도 10 및 도 11에 도시된 예를 참조하여 부연 설명하도록 한다.As illustrated in FIG. 9, the data compression method starts in step S110 of extracting a plurality of overlapping patterns for each field. In this step S110, the compression device 100 may extract various patterns that can be duplicated in a corresponding field by using a sliding window method, which will be described in detail with reference to examples shown in FIGS. 10 and 11. .

도 10 및 도 11은 원본 데이터의 특정 필드의 데이터(40)에서 상기 특정 필드에 대한 중복 패턴을 추출하는 예를 도시하고 있다. 구체적으로, 도 10은 크기가 "1"인 제1 슬라이딩 윈도우(41)를 이용하여 제1 중복 패턴(42)을 추출하는 예를 도시하고 있고, 도 11은 크기가 "2"인 제2 슬라이딩 윈도우(43)를 이용하여 제2 중복 패턴(44)을 추출하는 예를 도시하고 있다.10 and 11 show an example of extracting a duplicate pattern for the specific field from the data 40 of the specific field of the original data. Specifically, FIG. 10 shows an example of extracting the first overlapping pattern 42 using the first sliding window 41 having a size of “1”, and FIG. 11 shows a second sliding having a size of “2”. An example of extracting the second overlapping pattern 44 using the window 43 is shown.

도 10에 도시된 바와 같이, 압축 장치(100)는 제1 윈도우(41)를 이동해가며, 필드 데이터(40)에서 제1 윈도우(41)에 대응되는 필드 값(e.g. Mastercard, Visa)을 제1 중복 패턴(42)으로 추출할 수 있다. 또한, 도 11에 도시된 바와 같이, 압축 장치(100)는 제2 윈도우(43)를 이동해가며, 동일한 필드 데이터(40)에서 제2 윈도우(43)에 대응되는 필드 값을 제2 중복 패턴(44)을 추출할 수 있다.As illustrated in FIG. 10, the compression device 100 moves the first window 41, and in the field data 40, a field value (eg Mastercard, Visa) corresponding to the first window 41 is first generated. It can be extracted by the overlapping pattern 42. In addition, as shown in FIG. 11, the compression device 100 moves the second window 43 and sets the field value corresponding to the second window 43 in the same field data 40 to the second overlapping pattern ( 44) can be extracted.

제2 중복 패턴(44)은 일종의 시퀀스 패턴으로 "Mastercard-Visa"와 "Visa-Mastercard"는 서로 다른 패턴으로 추출될 수 있다. 그러나, 본 발명의 기술적 범위가 이에 한정되는 것은 아니다.The second overlapping pattern 44 is a kind of sequence pattern, and "Mastercard-Visa" and "Visa-Mastercard" may be extracted in different patterns. However, the technical scope of the present invention is not limited thereto.

압축 장치(100)는 윈도우의 크기를 최대 사이즈까지 증가시키며 다양한 중복 패턴을 추출할 수 있다. 이때, 상기 최대 사이즈는 대상 필드에 포함된 전체 레코드의 개수(e.g. 전체 레코드의 개수가 n인 경우, 최대 사이즈는 n/2)에 기초하여 결정될 수 있을 것이나, 본 발명의 기술적 범위가 이에 한정되는 것은 아니다.The compression device 100 increases the size of the window to the maximum size and extracts various overlapping patterns. At this time, the maximum size may be determined based on the total number of records included in the target field (eg, when the total number of records is n, the maximum size is n / 2), but the technical scope of the present invention is limited thereto. It is not.

단계 S130에서, 압축 장치(100)는 중복 패턴을 기초로 필드 별로 중복된 필드 값을 제거함으로써 원본 데이터에 대한 압축 데이터를 획득한다. 이때, 압축 장치(100)는 다양한 중복 패턴을 기초로 중복 제거 시뮬레이션을 수행함으로써, 중복 제거율을 산정한다. 또한, 압축 장치(100)는 중복 제거율이 가장 높은 중복 패턴의 조합으로 중복된 필드 값을 제거함으로써 상기 원본 데이터를 획득할 수 있다. 여기서, 상기 중복 제거율은 중복 제거된 필드 값의 개수에 기초한 압축 성능 지표로, 어떠한 방식으로 산정되더라도 무방하다.In step S130, the compression device 100 obtains compressed data for the original data by removing the duplicated field value for each field based on the duplicate pattern. At this time, the compression device 100 calculates the deduplication rate by performing deduplication simulation based on various overlapping patterns. In addition, the compression apparatus 100 may obtain the original data by removing duplicate field values with a combination of duplicate patterns having the highest duplicate removal rate. Here, the deduplication rate is a compression performance index based on the number of deduplicated field values, and may be calculated in any way.

본 발명의 실시예에 따르면, 압축 장치(100)는 복수의 중복 패턴을 이용하여 필드 별로 중복 제거(또는 중복 제거 시뮬레이션)을 수행할 수 있다. 이때, 상기 복수의 중복 패턴에는 적어도 일부는 서로 다른 크기의 패턴이 포함될 수 있다. 그렇게 함으로써, 중복 제거율이 더욱 향상될 수 있기 때문이다. 이해의 편의를 제공하기 위해, 본 실시예에 대하여 도 12 및 도 13을 참조하여 부연 설명하도록 한다.According to an embodiment of the present invention, the compression device 100 may perform deduplication (or deduplication simulation) for each field using a plurality of overlapping patterns. At this time, the plurality of overlapping patterns may include at least some of different patterns. This is because by doing so, the deduplication rate can be further improved. In order to provide convenience of understanding, the present embodiment will be described in detail with reference to FIGS. 12 and 13.

도 12 및 도 13은 각각 복수의 중복 패턴(이하, "중복 패턴 그룹"으로 칭함)에 기초하여 중복 제거 시뮬레이션을 수행한 예를 도시하고 있다. 구체적으로, 도 12는 크기가 1인 중복 패턴으로 구성된 제1 중복 패턴 그룹(45)을 기초로 중복 제거 시뮬레이션이 수행된 예를 도시하고 있고, 도 13은 크기가 2이하인 중복 패턴으로 구성된 제2 중복 패턴 그룹(47)을 기초로 중복 제거 시뮬레이션이 수행된 예를 도시하고 있다. 도 12 및 도 13에서 중복 제거된 필드 값은 취소 선으로 표시되었다.12 and 13 respectively show an example in which deduplication simulation is performed based on a plurality of overlapping patterns (hereinafter referred to as "duplicate pattern groups"). Specifically, FIG. 12 shows an example in which deduplication simulation is performed based on a first overlapping pattern group 45 composed of overlapping patterns having a size of 1, and FIG. 13 is a second composed of overlapping patterns having a size of 2 or less. An example in which deduplication simulation is performed based on the duplicate pattern group 47 is shown. In FIG. 12 and FIG. 13, the de-duplicate field values are marked with strikethrough lines.

도 12에 도시된 바와 같이, 중복 패턴에 매칭되는 필드 값이 연속적으로 나타난 경우에 중복 제거가 수행될 수 있다. 구체적으로, 제1 중복 패턴 그룹(45)에 포함된 제1 중복 패턴("Mastercard")에 기반하여, 연속된 3개의 필드 값("Mastercard") 중 2개의 필드 값이 제거될 수 있다. 동일하게, 제2 중복 패턴("Visa")에 기반하여, 연속된 3개의 필드 값("Visa") 중 2개의 필드 값이 제거될 수 있다.As illustrated in FIG. 12, when a field value matching a duplicate pattern is continuously displayed, duplicate removal may be performed. Specifically, based on the first overlapping pattern ("Mastercard") included in the first overlapping pattern group 45, two field values of three consecutive field values ("Mastercard") may be removed. Similarly, two field values of three consecutive field values (“Visa”) may be removed based on the second overlapping pattern (“Visa”).

또한, 도 13에 도시된 바와 같이, 제2 중복 패턴 그룹(47)에 포함된 제3 중복 패턴("Mastercard-Visa")에 기반하여, 연속된 2개의 필드 값("Mastercard-Visa") 중 1개의 필드 값이 제거될 수 있다. 동일하게, 다른 중복 패턴("Visa", "Mastercard")에 기반하여, 중복된 필드 값이 더 제거될 수 있다.In addition, as shown in FIG. 13, based on the third overlapping pattern ("Mastercard-Visa") included in the second overlapping pattern group 47, among two consecutive field values ("Mastercard-Visa") One field value can be removed. Equally, based on different overlapping patterns ("Visa", "Mastercard"), duplicate field values may be further removed.

도 12 및 도 13의 중복 제거 시뮬레이션 결과(46, 48)를 비교해보면, 제2 중복 패턴 그룹(47)이 적용된 경우 중복 제거율이 더 높은 것을 확인할 수 있다. 따라서, 2개의 중복 패턴 그룹(45, 47) 중에서는 제2 중복 패턴 그룹(47)에 기반하여 해당 필드에 대한 중복 제거가 수행될 수 있다.When comparing the deduplication simulation results 46 and 48 of FIGS. 12 and 13, it can be seen that the deduplication rate is higher when the second duplication pattern group 47 is applied. Therefore, among the two overlapping pattern groups 45 and 47, deduplication for the corresponding field may be performed based on the second duplicate pattern group 47.

압축 장치(100)는 도 12 및 도 13에 도시된 바와 같은 방식으로 다양한 중복 패턴 그룹에 대해 중복 제거율을 비교하고, 중복 제거율이 가장 높은 특정 중복 패턴 그룹을 이용하여 최종적으로 중복 제거를 수행할 수 있다. 이때, 중복 제거 시뮬레이션의 횟수는 실시예에 따라 얼마든지 달라질 수 있다. 가령, 압축 장치(100)는 중복 제거율이 임계 점수 이상이 될 때까지 반복하여 시뮬레이션을 수행할 수 있다. 또는 압축 장치(100)의 컴퓨팅 성능 또는 가용 리소스에 기반하여 일정 횟수만큼 시뮬레이션을 수행할 수도 있다.The compression device 100 compares deduplication rates for various duplicate pattern groups in the manner as shown in FIGS. 12 and 13, and can finally perform deduplication using a specific duplicate pattern group having the highest deduplication rate. have. At this time, the number of deduplication simulations may vary depending on the embodiment. For example, the compression device 100 may repeatedly perform a simulation until the deduplication rate becomes a threshold score or more. Alternatively, the simulation may be performed a predetermined number of times based on the computing power or available resources of the compression device 100.

도 6에 도시된 원본 데이터(23)에 대하여, 중복 제거가 수행된 결과(50)는 도 14에 도시되어 있다. 특히, 도 14에는, 원본 데이터(23)의 제3 필드가 중복 패턴("Master-Visa-BC")에 의해 중복 제거가 수행된 것이 예로써 도시되었다.For the original data 23 shown in FIG. 6, the result 50 in which deduplication is performed is shown in FIG. In particular, in FIG. 14, it is illustrated as an example that the third field of the original data 23 is subjected to deduplication by a duplicate pattern (“Master-Visa-BC”).

다시 도 9를 참조하여 설명한다.It will be described again with reference to FIG. 9.

단계 S150에서, 압축 장치(100)는 기 설정된 규칙에 따라 압축 데이터를 전송한다. 본 단계 S150에 대한 자세한 설명은 도 15 내지 도 20을 참조하여 상세하게 설명하도록 한다.In step S150, the compression device 100 transmits compressed data according to a preset rule. The detailed description of this step S150 will be described in detail with reference to FIGS. 15 to 20.

참고로, 전술한 단계 S110 내지 S150 중에서, 단계 S110 및 S130은 중복 패턴 검출부(151)와 중복 제거부(153)에 의해 수행되고, 단계 S150은 전송부(150)에 의해 수행될 수 있다.For reference, among the above-described steps S110 to S150, steps S110 and S130 may be performed by the overlap pattern detection unit 151 and the duplicate removal unit 153, and step S150 may be performed by the transmission unit 150.

한편, 원본 데이터가 매우 많은 수의 레코드로 구성된 경우, 압축 장치(100)는 상기 원본 데이터를 소정의 개수의 레코드를 갖는 부분 데이터로 분할하고, 각 부분 데이터에 대하여 전술한 단계 S110 내지 S150들을 수행할 수도 있다. 이때, 상기 소정의 개수는 기 설정된 고정 값 또는 상황에 따라 변동되는 변동 값일 수 있다. 가령, 상기 소정의 개수는 압축 장치(100)의 컴퓨팅 성능 또는 가용 리소스에 기초하여 변동되는 변동 값일 수 있다.On the other hand, when the original data is composed of a very large number of records, the compression device 100 divides the original data into partial data having a predetermined number of records, and performs the above steps S110 to S150 for each partial data You may. At this time, the predetermined number may be a preset fixed value or a variable value that fluctuates depending on the situation. For example, the predetermined number may be a fluctuating value that fluctuates based on computing power or available resources of the compression device 100.

이하에서는, 도 15 내지 도 20을 참조하여 본 발명의 일 실시예에 따른 데이터 전송 방법에 대하여 설명하도록 한다.Hereinafter, a data transmission method according to an embodiment of the present invention will be described with reference to FIGS. 15 to 20.

도 15는 본 발명의 일 실시예들에 따른 데이터 전송 방법을 나타내는 예시적인 흐름도이다. 단, 이는 본 발명의 목적을 달성하기 위한 바람직한 실시예일뿐이며, 필요에 따라 일부 단계가 추가되거나 삭제될 수 있음은 물론이다.15 is an exemplary flowchart illustrating a data transmission method according to one embodiment of the present invention. However, this is only a preferred embodiment for achieving the object of the present invention, and of course, some steps may be added or deleted as necessary.

도 15에 도시된 바와 같이, 상기 데이터 전송 방법은 기 설정된 규칙에 따라 압축 데이터를 하나 이상의 전송 그룹으로 구성하는 단계 S210에서 시작된다. 즉, 압축 장치(100)는 압축 데이터를 한번에 전송하는 것이 아니라, 다수의 전송 그룹으로 분할하여 전송한다. 또한, 압축 장치(100)는 불필요한 필드 구분자를 제외하고 필수적인 필드 값들만을 전송 그룹으로 구성하여 전송을 수행한다. 그렇게 함으로써, 필드 구분자에 의해 낭비되는 네트워크 비용이 감소될 수 있다. 다만, 전송 그룹의 개수가 지나치게 많아지면, 전송 헤더로 인해 오히려 네트워크 비용이 증가할 수 있으므로, 적정하게 전송 그룹을 구성하는 것이 중요하다. 이하, 본 발명의 실시예에 따른 전송 그룹 구성 방법에 대하여 도 16을 참조하여 부연 설명하도록 한다.As illustrated in FIG. 15, the data transmission method starts at step S210 of configuring compressed data into one or more transmission groups according to a preset rule. That is, the compression device 100 does not transmit compressed data at once, but divides and transmits the data into a plurality of transmission groups. In addition, the compression apparatus 100 performs transmission by configuring only the essential field values, excluding unnecessary field separators, into a transmission group. By doing so, the network cost wasted by the field separator can be reduced. However, if the number of transport groups is excessively large, network costs may increase due to transport headers, so it is important to properly configure the transport group. Hereinafter, a method of configuring a transmission group according to an embodiment of the present invention will be described in detail with reference to FIG. 16.

도 16은 압축 데이터(50)에 대하여 전송 그룹이 구성된 예를 도시하고 있다.16 shows an example in which a transmission group is configured for compressed data 50.

도 16에 도시된 바와 같이, 기 설정된 규칙에 따라 압축 데이터(50)에 대하여 다수의 전송 그룹(51 내지 55)이 형성될 수 있다.As illustrated in FIG. 16, a plurality of transmission groups 51 to 55 may be formed for the compressed data 50 according to a preset rule.

상기 기 설정된 규칙에 따르면, 압축이 적용 되지 않은 필드(e.g. 중복 필드 값이 없거나 압축 대상이 아닌 필드)의 데이터와 압축이 적용된 필드의 데이터는 서로 다른 전송 그룹으로 구성될 수 있다. 가령, 압축 데이터(50)에서 압축되지 않은 제4 필드의 데이터(51)는 별도의 전송 그룹(G1)으로 구성될 수 있다.According to the preset rule, data in a field in which compression is not applied (e.g., a field having no duplicate field value or is not a compression target) and data in a field in which compression is applied may be configured in different transmission groups. For example, the data 51 of the fourth field that is not compressed in the compressed data 50 may be configured as a separate transmission group G1.

또한, 상기 기 설정된 규칙에 따르면, 원본 데이터 상에서 인접하여 위치한 필드 값들이 하나의 전송 그룹으로 구성될 수 있다. 가령, 원본 데이터(23) 상에서 인접하여 위치한 제1 필드 값들(52)이 전송 그룹(G2)으로 구성되고, 제2 필드 값들(53)이 전송 그룹(G3)으로 구성될 수 있다. 또한, 제3 필드 값(54)과 제4 필드 값(55) 각각은 인접한 필드 값이 없으므로, 단독으로 전송 그룹(G4, G5)을 구성할 수 있다. 이하에서는, 참조 번호(51 내지 55)를 전송 그룹 또는 전송 그룹에 속한 필드 값을 함께 지칭하는 것으로 혼용하도록 한다.Further, according to the preset rule, field values located adjacent to the original data may be configured as one transmission group. For example, first field values 52 adjacent to each other on the original data 23 may be configured as a transmission group G2, and second field values 53 may be configured as a transmission group G3. In addition, since each of the third field value 54 and the fourth field value 55 does not have adjacent field values, the transmission groups G4 and G5 can be configured alone. Hereinafter, reference numbers 51 to 55 are used together to refer to a transmission group or a field value belonging to a transmission group.

다시 도 15를 참조하여 설명한다.It will be described again with reference to FIG. 15.

단계 S230에서, 압축 장치(100)는 전송 그룹 별로 전송 헤더를 생성한다. 이때, 전송 헤더에는 해당 전송 그룹의 위치 정보가 포함된다. 상기 위치 정보는 수신 장치(미도시)에서 원본 데이터를 복원하기 위해 이용된다.In step S230, the compression device 100 generates a transmission header for each transmission group. At this time, the transmission header includes location information of the corresponding transmission group. The location information is used by the receiving device (not shown) to restore the original data.

도 16에서 상기 전송 헤더는 태그 아이콘으로 표시되었고, 태그 아이콘 내에 기재된 좌표 정보는 각 전송 그룹을 위치 정보를 가리킨다. 도 16에 도시된 바와 같이, 상기 위치 정보는 원본 데이터 상에서 해당 전송 그룹이 위치한 곳의 좌표 정보가 될 수 있다. 가령, 제1 전송 그룹(G1)의 위치 정보는 첫 번째 필드 값("Product1)의 좌표 정보 (0,0)이 될 수 있다. 그러나, 본 발명의 기술적 범위가 이에 한정되는 것은 아니며, 전송 그룹의 위치 정보를 표현하는 방식은 얼마든지 달라질 수 있다.In FIG. 16, the transmission header is indicated by a tag icon, and coordinate information described in the tag icon indicates location information of each transmission group. As shown in FIG. 16, the location information may be coordinate information of a location where a corresponding transmission group is located on original data. For example, the location information of the first transmission group G1 may be the coordinate information (0,0) of the first field value ("Product1). However, the technical scope of the present invention is not limited thereto, and the transmission group The method of expressing the location information of may vary.

또한, 도 16에 도시된 바와 같이, 상기 좌표 정보는 전송 그룹에 포함된 기준 필드 값(e.g. 전송 그룹 52의 경우 첫 번째 필드 값)의 좌표 정보일 수 있으나, 본 발명의 기술적 범위가 이에 한정되는 것은 아니다.In addition, as illustrated in FIG. 16, the coordinate information may be coordinate information of a reference field value (eg, the first field value in the case of transmission group 52) included in a transmission group, but the technical scope of the present invention is limited thereto. It is not.

또한, 상기 전송 헤더에는 원본 데이터 상에서 해당 전송 그룹이 차지하는 크기 값(e.g. 전송 그룹 52의 경우 3 x 3)이 더 포함될 수도 있다.In addition, the transmission header may further include a size value (e.g. 3 x 3 in the case of transmission group 52) occupied by the corresponding transmission group on the original data.

다시 도 15를 참조하면, 단계 S250에서, 압축 장치(100)는 압축 데이터를 전송 그룹 단위로 구분하여 전송한다. 구체적으로, 압축 장치(100)는 각 전송 그룹 별로 전송 헤더를 부가하여 압축 데이터를 전송한다(도 16 참조). 여기서, 각 전송 그룹 사이에 위치한 제1 필드 구분자(e.g. 전송 그룹 51, 52 사이의 구분자), 중복 제거된 필드 값들 사이의 제2 필드 구분자 등은 수신 장치(미도시)로 전송되지 않을 수 있다. 상기 제1 필드 구분자 및 상기 제2 필드 구분자가 없더라도 수신 장치(미도시)에서 정확하게 원본 데이터(즉, 필드 구분자까지도)를 복원할 수 있기 때문이다.Referring back to FIG. 15, in step S250, the compression device 100 divides and transmits compressed data in units of transmission groups. Specifically, the compression apparatus 100 transmits compressed data by adding a transmission header for each transmission group (see FIG. 16). Here, a first field separator (e.g., a separator between transmission groups 51 and 52) located between each transmission group, a second field separator between duplicated field values may not be transmitted to a receiving device (not shown). This is because the original data (that is, even the field separator) can be accurately restored by the receiving device (not shown) even if the first field separator and the second field separator are not present.

상기 전송 헤더에는 전송 그룹의 위치 정보가 포함되므로, 수신 장치(미도시)는 원본 데이터를 복원할 수 있다. 이하, 원본 데이터를 복원하는 방법에 대하여 도 17을 참조하여 부연 설명한다.Since the transmission header includes location information of a transmission group, a receiving device (not shown) can restore original data. Hereinafter, a method of restoring original data will be described in detail with reference to FIG. 17.

도 17에 도시된 바와 같이, 수신 장치(미도시)는 전송 그룹(51 내지 55)의 위치 정보를 이용하여 압축 데이터(56)를 재구성할 수 있다. 가령, 제1 전송 그룹(51)이 (3,0) 위치에 배치되고, 제2 전송 그룹(52)이 (0,0)에 배치되는 등의 과정이 반복됨으로써 압축 데이터가(56)가 재구성될 수 있다.As shown in FIG. 17, the receiving device (not shown) may reconstruct the compressed data 56 using the location information of the transmission groups 51 to 55. For example, the compressed data 56 is reconstructed by repeating a process such that the first transmission group 51 is placed at the (3,0) position, the second transmission group 52 is placed at the (0,0), and the like. Can be.

또한, 수신 장치(100)는 도 17의 하단에 도시된 바와 같이 제1 필드의 공백 부분(57)에 필드 값("Product1")을 복사할 수 있다. 즉, 중복 제거 때와는 반대로, 중복 패턴에 해당하는 상단의 필드 값("Product1")을 아래 방향(즉, 중복 제거와 동일한 방향)으로 순차적으로 복사함으로써, 원본 데이터의 필드 값이 복원될 수 있다. 이와 같은 과정이, 각 필드 별로 반복됨으로써, 데이터 손실 없이 전체 원본 데이터가 복원될 수 있다(도 17의 밑줄 친 필드 값 참조). 물론, 필요에 따라, 필드 구분자도 복원될 수 있다.In addition, the receiving device 100 may copy the field value (“Product1”) to the blank portion 57 of the first field, as shown at the bottom of FIG. 17. That is, as opposed to deduplication, the field values of the original data can be restored by sequentially copying the field values ("Product1") at the top corresponding to the duplicate patterns in the downward direction (ie, the same direction as deduplication). have. By repeating this process for each field, the entire original data can be restored without losing data (see the underlined field values in FIG. 17). Of course, if necessary, the field separator can also be restored.

한편, 본 발명의 실시예에 따르면, 특정 조건이 만족됨에 응답하여, 전송 그룹에 대한 분리가 수행될 수 있다. 구체적으로, 전송 그룹에 속한 동일 필드의 필드 값들이 하나의 중복 패턴에 매칭되지 않는 경우, 전송 그룹에 대한 분리가 이루어질 수 있다. 하나의 중복 패턴에 매칭되지 않는 동일 필드의 필드 값이 하나의 전송 그룹으로 전송되면, 수신 장치(미도시) 측의 데이터 복원 과정에 오류가 있을 수 있기 때문이다. 이해의 편의를 제공하기 위해, 도 18 내지 도 20에 도시된 예를 참조하여 부연 설명하도록 한다. Meanwhile, according to an embodiment of the present invention, in response to a specific condition being satisfied, separation for a transmission group may be performed. Specifically, when the field values of the same field belonging to the transmission group do not match one overlapping pattern, separation for the transmission group may be performed. This is because if a field value of the same field that does not match one duplicate pattern is transmitted to one transmission group, there may be an error in the data restoration process of the receiving device (not shown). In order to provide convenience of understanding, a description will be given with reference to examples shown in FIGS. 18 to 20.

도 18은 도 16과 동일하게 압축 데이터(60)에서 인접하여 위치한 복수의 필드 값(62)이 하나의 전송 그룹으로 구성된 것을 예시하고 있다. 다만, 도 16과는 달리, 압축 데이터(60)에서 제3 필드의 공백 필드(63)는 중복 패턴("BC", 61)에 의해 생성된 것이다. 그러나, 전송 그룹(62)을 분리 없이 전송하는 경우, 수신 장치(미도시)는 도 17에 도시된 바와 같이 복원할 것이므로, 원본 데이터가 정확하게 복원될 수 없다.18 illustrates that a plurality of field values 62 adjacent to each other in the compressed data 60 are configured as one transmission group, as in FIG. 16. However, unlike FIG. 16, the blank field 63 of the third field in the compressed data 60 is generated by the overlapping patterns (“BC”, 61). However, when the transmission group 62 is transmitted without separation, since the receiving device (not shown) will restore as shown in FIG. 17, the original data cannot be accurately restored.

위와 같은 문제점을 해결하기 위해, 압축 장치(100)는 제3 필드에 속하고 서로 인접하여 위치한 필드 값들(64)이 하나의 중복 패턴(e.g. Mastercard-Visa-BC)에 매칭되는지 여부를 판정한다. 또한, 상기 중복 패턴에 매칭되지 않는다는 판정에 응답하여, 압축 장치(100)는 도 19에 도시된 바와 같이 전송 그룹(62)을 서로 다른 서브 전송 그룹(65, 66)으로 분리한다. 이때, 중복 제거에 이용된 중복 패턴(61)이 "BC"이므로, 압축 장치(100)는 필드 값(BC)을 별도의 서브 전송 그룹(66)으로 분리할 수 있다. 만약, 중복 패턴이 "Visa-BC"라면, 압축 장치(100)는 필드 값("Visa", "BC")을 별도의 서브 전송 그룹(66)으로 분리할 수 있을 것이다.To solve the above problem, the compression device 100 determines whether field values 64 belonging to the third field and located adjacent to each other match one overlapping pattern (e.g. Mastercard-Visa-BC). In addition, in response to determining that the overlapping pattern is not matched, the compression device 100 divides the transmission group 62 into different sub transmission groups 65 and 66 as shown in FIG. 19. At this time, since the duplicate pattern 61 used for deduplication is "BC", the compression apparatus 100 may separate the field value BC into a separate sub transmission group 66. If the overlapping pattern is "Visa-BC", the compression device 100 may separate the field values ("Visa" and "BC") into separate sub-transmission groups 66.

도 20은 전송 그룹을 분리하여 전송했을 때 수신 장치(미도시) 측에서 원본 데이터를 복원한 예를 도시하고 있다.20 illustrates an example in which original data is restored at a receiving device (not shown) when a transmission group is separated and transmitted.

도 20에 도시된 바와 같이, 수신 장치(미도시)는 서브 전송 그룹(66)이 별도로 수신되면, 서브 전송 그룹(66)의 필드 값("BC")을 기초로 공백 필드(68)를 복원하게 되며, 이에 따라 원본 데이터가 정확하게 복원될 수 있다.As shown in FIG. 20, the receiving device (not shown) restores the blank field 68 based on the field value (“BC”) of the sub transmission group 66 when the sub transmission group 66 is separately received. Thus, the original data can be accurately restored.

참고로, 전술한 단계 S210 내지 S250 중에서, 단계 S210는 전송 그룹 구성부(171)에 의해 수행되고, 단계 S230 및 S250은 데이터 전송부(173)에 의해 수행될 수 있다.For reference, among the above-described steps S210 to S250, step S210 may be performed by the transmission group configuration unit 171, and steps S230 and S250 may be performed by the data transmission unit 173.

지금까지 도 15 내지 도 20을 참조하여 본 발명의 일 실시예에 따른 데이터 전송 방법에 대하여 설명하였다.So far, a data transmission method according to an embodiment of the present invention has been described with reference to FIGS. 15 to 20.

지금까지 도 4 내지 도 20을 참조하여 본 발명의 다양한 실시예들에 따른 방법들을 설명하였다. 지금까지 상술한 본 발명의 다양한 실시예들에 따르면, 복수의 필드를 갖는 레코드형 데이터에 대하여 다양한 이점이 제공될 수 있다.So far, methods according to various embodiments of the present invention have been described with reference to FIGS. 4 to 20. According to various embodiments of the present invention described above, various advantages may be provided for record type data having a plurality of fields.

먼저, 필드 단위로 중복 패턴이 빈번하게 발생되는 레코드형 데이터에 대해 압축 성능이 크게 향상될 수 있다. 특히, 단순히 중복된 필드 값만 제거되는 것이 아니라, 필드 구분자까지 일부 제거됨으로써, 빅 데이터의 경우 획기적으로 압축 성능이 향상될 수 있다.First, compression performance may be greatly improved for record-type data in which duplicate patterns are frequently generated in units of fields. Particularly, not only the duplicated field values are removed, but also the field separators are partially removed, so that in the case of big data, compression performance can be significantly improved.

또한, 전술한 기술적 사상에는 종래에 알려진 텍스트 압축 기법이 추가로 활용될 수 있다. 가령, 레코드형 데이터에 텍스트 압축을 수행한 후, 본 발명의 실시예에 따른 압축이 다시 수행될 수 있다. 또는, 반대의 순서로도 압축이 수행될 수 있다. 이에 따라, 전반적인 압축 성능이 더욱 향상될 수 있다.Further, in the above-described technical idea, a text compression technique known in the art may be additionally utilized. For example, after performing text compression on the record type data, compression according to an embodiment of the present invention may be performed again. Alternatively, compression may be performed in the reverse order. Accordingly, overall compression performance may be further improved.

또한, 전술한 기술적 사상은 사전 기반 압축 기법과는 달리 별도의 사전을 이용하지 않으므로, 중복 패턴이 빈번하게 변동되는 레코드형 데이터에도 용이하게 활용될 수 있다. 물론, 동적으로 사전을 구축하고, 전송 측에서 기 구축된 사전을 압축 데이터와 함께 수신 측으로 전송하는 방법도 가능하다. 그러나, 본 발명의 실시예들에 따르면, 동적으로 구축된 사전을 수신 측에 전송할 필요가 없으므로 네트워크 비용이 훨씬 절감될 수 있다.In addition, since the above-described technical idea does not use a separate dictionary unlike the dictionary-based compression technique, it can be easily used for record-type data in which the overlapping pattern is frequently changed. Of course, it is also possible to dynamically build a dictionary and transmit the pre-built dictionary from the transmitting side together with the compressed data to the receiving side. However, according to embodiments of the present invention, there is no need to transmit a dynamically constructed dictionary to the receiving side, so that the network cost can be further reduced.

이하에서는, 도 21을 참조하여 본 발명의 다양한 실시예들에 따른 장치(e.g. 압축 장치100)를 구현할 수 있는 예시적인 컴퓨팅 장치에 대하여 간략하게 언급하도록 한다.Hereinafter, an exemplary computing device capable of implementing a device (e.g. compression device 100) according to various embodiments of the present invention will be briefly described with reference to FIG. 21.

도 21은 본 발명의 다양한 실시예들에 따른 장치(e.g. 압축 장치100)를 구현할 수 있는 예시적인 컴퓨팅 장치를 나타내는 하드웨어 구성도이다.21 is a hardware configuration diagram illustrating an exemplary computing device capable of implementing a device (e.g. compression device 100) according to various embodiments of the present invention.

도 21을 참조하면, 컴퓨팅 장치(200)는 하나 이상의 프로세서(210), 버스(250), 통신 인터페이스(270), 프로세서(210)에 의하여 수행되는 컴퓨터 프로그램을 로드(load)하는 메모리(230)와, 컴퓨터 프로그램(291)을 저장하는 스토리지(290)를 포함할 수 있다. 다만, 도 21에는 본 발명의 실시예와 관련 있는 구성요소들만이 도시되어 있다. 따라서, 본 발명이 속한 기술분야의 통상의 기술자라면 도 21에 도시된 구성요소들 외에 다른 범용적인 구성 요소들이 더 포함될 수 있음을 알 수 있다.Referring to FIG. 21, the computing device 200 includes one or more processors 210, a bus 250, a communication interface 270, and a memory 230 that loads computer programs performed by the processor 210. And, it may include a storage 290 for storing the computer program 291. However, only components related to the embodiment of the present invention are illustrated in FIG. 21. Therefore, it can be seen that a person skilled in the art to which the present invention pertains may further include other general-purpose components in addition to the components shown in FIG. 21.

프로세서(210)는 컴퓨팅 장치(200)의 각 구성의 전반적인 동작을 제어한다. 프로세서(210)는 CPU(Central Processing Unit), MPU(Micro Processor Unit), MCU(Micro Controller Unit), GPU(Graphic Processing Unit) 또는 본 발명의 기술 분야에 잘 알려진 임의의 형태의 프로세서를 포함하여 구성될 수 있다. 또한, 프로세서(210)는 본 발명의 다양한 실시예들에 따른 방법을 실행하기 위한 적어도 하나의 애플리케이션 또는 프로그램에 대한 연산을 수행할 수 있다. 컴퓨팅 장치(200)는 하나 이상의 프로세서를 구비할 수 있다.The processor 210 controls the overall operation of each component of the computing device 200. The processor 210 includes a CPU (Central Processing Unit), an MPU (Micro Processor Unit), an MCU (Micro Controller Unit), a GPU (Graphic Processing Unit), or any type of processor well known in the art. Can be. Also, the processor 210 may perform operations on at least one application or program for executing a method according to various embodiments of the present invention. Computing device 200 may include one or more processors.

메모리(230)는 각종 데이터, 명령 및/또는 정보를 저장한다. 메모리(230)는 본 발명의 다양한 실시예들에 따른 방법을 실행하기 위하여 스토리지(290)로부터 하나 이상의 프로그램(291)을 로드할 수 있다. 메모리(230)는 RAM과 같은 비휘발성 저장 장치로 구현될 수 있으나, 본 발명의 기술적 범위가 이에 한정되는 것은 아니다.The memory 230 stores various data, commands and / or information. The memory 230 may load one or more programs 291 from the storage 290 to execute the method according to various embodiments of the present invention. The memory 230 may be implemented as a non-volatile storage device such as RAM, but the technical scope of the present invention is not limited thereto.

메모리(230)에 본 발명의 실시예에 따른 데이터 압축 전송 방법을 수행하는 컴퓨터 프로그램이 로드되면, 메모리(230) 상에 도 2에 도시된 구성 요소가 로직의 형태로 구현될 수 있다. When a computer program performing a data compression transmission method according to an embodiment of the present invention is loaded in the memory 230, the components illustrated in FIG. 2 may be implemented in the form of logic on the memory 230.

버스(250)는 컴퓨팅 장치(200)의 구성 요소 간 통신 기능을 제공한다. 버스(250)는 주소 버스(Address Bus), 데이터 버스(Data Bus) 및 제어 버스(Control Bus) 등 다양한 형태의 버스로 구현될 수 있다.The bus 250 provides communication functions between components of the computing device 200. The bus 250 may be implemented as various types of buses, such as an address bus, a data bus, and a control bus.

통신 인터페이스(270)는 컴퓨팅 장치(200)의 유무선 인터넷 통신을 지원한다. 또한, 통신 인터페이스(270)는 인터넷 통신 외의 다양한 통신 방식을 지원할 수도 있다. 이를 위해, 통신 인터페이스(270)는 본 발명의 기술 분야에 잘 알려진 통신 모듈을 포함하여 구성될 수 있다.The communication interface 270 supports wired and wireless Internet communication of the computing device 200. In addition, the communication interface 270 may support various communication methods other than Internet communication. To this end, the communication interface 270 may include a communication module well known in the technical field of the present invention.

스토리지(290)는 상기 하나 이상의 프로그램(291)을 비임시적으로 저장할 수 있다. 스토리지(290)는 ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리 등과 같은 비휘발성 메모리, 하드 디스크, 착탈형 디스크, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터로 읽을 수 있는 기록 매체를 포함하여 구성될 수 있다.The storage 290 may store the one or more programs 291 non-temporarily. The storage 290 is a non-volatile memory such as a read only memory (ROM), an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EPMROM), a flash memory, a hard disk, a removable disk, or well in the art. And any known form of computer-readable recording media.

컴퓨터 프로그램(291)은 메모리(230)에 로드될 때, 프로세서(210)로 하여금 본 발명의 다양한 실시예들에 따른 동작들을 수행하도록 하는 하나 이상의 인스트럭션들을 포함할 수 있다. 즉, 프로세서(10)는 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 동작들을 수행할 수 있다.The computer program 291 may include one or more instructions that, when loaded into the memory 230, cause the processor 210 to perform operations according to various embodiments of the present invention. That is, the processor 10 may perform the operations by executing the one or more instructions.

예를 들어, 컴퓨터 프로그램(291)은 복수의 필드를 갖는 레코드의 집합으로 구성된 원본 데이터를 입력받고, 상기 원본 데이터의 필드 별로 상기 원본 데이터에 포함된 중복 필드 값을 제거함으로써, 상기 원본 데이터에 대한 압축 데이터 획득하는 동작을 수행하도록 하는 하나 이상의 인스트럭션들을 포함할 수 있다. 이와 같은 경우, 컴퓨팅 장치(200)를 통해 본 발명의 실시예에 따른 압축 장치(100)가 구현될 수 있다.For example, the computer program 291 receives original data composed of a set of records having a plurality of fields, and removes duplicate field values included in the original data for each field of the original data, thereby removing the original field data. It may include one or more instructions to perform the operation of obtaining compressed data. In this case, the compression device 100 according to an embodiment of the present invention may be implemented through the computing device 200.

지금까지 도 21을 참조하여 본 발명의 다양한 실시예들에 따른 장치를 구현할 수 있는 예시적인 컴퓨팅 장치(200)에 대하여 설명하였다.So far, an exemplary computing device 200 capable of implementing a device according to various embodiments of the present invention has been described with reference to FIG. 21.

지금까지 도 1 내지 도 21을 참조하여 본 발명의 다양한 실시예들 및 그 실시예들에 따른 효과들을 언급하였다. 본 발명의 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.So far, various embodiments of the present invention and effects according to the embodiments have been described with reference to FIGS. 1 to 21. The effects of the present invention are not limited to the above-mentioned effects, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

지금까지 도 1 내지 도 21을 참조하여 설명된 본 발명의 개념은 컴퓨터가 읽을 수 있는 매체 상에 컴퓨터가 읽을 수 있는 코드로 구현될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록 매체는, 예를 들어 이동형 기록 매체(CD, DVD, 블루레이 디스크, USB 저장 장치, 이동식 하드 디스크)이거나, 고정식 기록 매체(ROM, RAM, 컴퓨터 구비 형 하드 디스크)일 수 있다. 상기 컴퓨터로 읽을 수 있는 기록 매체에 기록된 상기 컴퓨터 프로그램은 인터넷 등의 네트워크를 통하여 다른 컴퓨팅 장치에 전송되어 상기 다른 컴퓨팅 장치에 설치될 수 있고, 이로써 상기 다른 컴퓨팅 장치에서 사용될 수 있다.The concept of the present invention described so far with reference to FIGS. 1 to 21 may be embodied as computer readable code on a computer readable medium. The computer-readable recording medium may be, for example, a removable recording medium (CD, DVD, Blu-ray Disc, USB storage device, removable hard disk), or a fixed recording medium (ROM, RAM, computer-equipped hard disk). You can. The computer program recorded on the computer-readable recording medium may be transmitted to another computing device through a network such as the Internet and installed on the other computing device, thereby being used in the other computing device.

이상에서, 본 발명의 실시예를 구성하는 모든 구성 요소들이 하나로 결합되거나 결합되어 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다.In the above, even if all the components constituting the embodiments of the present invention are described as being combined or operated as one, the present invention is not necessarily limited to these embodiments. That is, within the object scope of the present invention, all of the components may be selectively combined and operated.

도면에서 동작들이 특정한 순서로 도시되어 있지만, 반드시 동작들이 도시된 특정한 순서로 또는 순차적 순서로 실행되어야만 하거나 또는 모든 도시 된 동작들이 실행되어야만 원하는 결과를 얻을 수 있는 것으로 이해되어서는 안 된다. 특정 상황에서는, 멀티태스킹 및 병렬 처리가 유리할 수도 있다. 더욱이, 위에 설명한 실시예들에서 다양한 구성들의 분리는 그러한 분리가 반드시 필요한 것으로 이해되어서는 안 되고, 설명된 프로그램 컴포넌트들 및 시스템들은 일반적으로 단일 소프트웨어 제품으로 함께 통합되거나 다수의 소프트웨어 제품으로 패키지 될 수 있음을 이해하여야 한다.Although the operations are shown in a specific order in the drawings, it should not be understood that the operations must be executed in a specific order or in a sequential order, or all illustrated operations must be executed to obtain a desired result. In certain situations, multitasking and parallel processing may be advantageous. Moreover, the separation of various configurations in the above-described embodiments should not be understood as such a separation is not necessarily necessary, and the described program components and systems may generally be integrated together into a single software product or packaged into multiple software products. It should be understood that there is.

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Although the embodiments of the present invention have been described with reference to the accompanying drawings, a person skilled in the art to which the present invention pertains may be implemented in other specific forms without changing the technical concept or essential features of the present invention. Can understand. Therefore, it should be understood that the above-described embodiments are illustrative in all respects and not restrictive. The scope of protection of the present invention should be interpreted by the claims below, and all technical spirits within the scope equivalent thereto should be interpreted as being included in the scope of the present invention.

Claims

A method for transmitting data compression performed by a computing device,
Receiving original data composed of a set of records having a plurality of fields;
Obtaining compressed data for the original data by removing duplicate field values for each field of the original data; And
And transmitting the obtained compressed data,
The transmitting step,
Grouping field values located adjacent to each other on the original data to form a transmission group; And
And dividing and transmitting the compressed data in units of the transmission group.
Data compression transfer method.

According to claim 1,
Further comprising the step of determining whether to perform compression based on the compression fitness of the input original data,
The obtaining step is characterized in that it is performed in response to the determination to perform the compression,
Data compression transfer method.

According to claim 2,
Determining whether to perform the compression,
Counting the number of fields having categorical data among the plurality of fields; And
And determining whether to perform the compression based on the number of counted fields.
Data compression transfer method.

According to claim 2,
Determining whether to perform the compression,
Calculating the number of unique field values that do not overlap for each field of the original data;
Counting the number of fields in which the number of unique field values is less than a threshold value among the plurality of fields; And
And determining whether to perform the compression based on the number of counted fields.
Data compression transfer method.

According to claim 2,
Determining whether to perform the compression,
Sampling some data from the original data; And
And determining whether to perform the compression of the original data based on the compression suitability of the sampled partial data.
Data compression transfer method.

According to claim 1,
The step of obtaining the compressed data,
Calculating the number of unique field values that do not overlap for each field of the original data;
Selecting a field in which the number of unique field values is less than a threshold value from among the plurality of fields as a compression target field; And
And for the field to be compressed, removing a duplicate field value.
Data compression transfer method.

According to claim 1,
The step of obtaining the compressed data,
Extracting a plurality of overlapping patterns for the first field by applying a sliding window to the first field of the original data;
Calculating a first deduplication rate by performing first deduplication based on a first overlapping pattern among the plurality of duplicate patterns;
Calculating a second deduplication rate by performing a second deduplication based on a second overlapping pattern among the plurality of duplicate patterns; And
And in response to determining that the first deduplication rate is higher than the second deduplication rate, determining the first deduplication data as compressed data of the first field.
Data compression transfer method.

The method of claim 7,
Extracting the plurality of overlapping patterns,
Characterized in that it comprises the step of extracting the plurality of overlapping pattern using a sliding window of different sizes,
Data compression transfer method.

The method of claim 8,
The first overlapping pattern includes two or more overlapping patterns selected from the plurality of overlapping patterns,
Characterized in that at least some of the two or more overlapping patterns are patterns of different sizes,
Data compression transfer method.

According to claim 1,
The plurality of fields includes a first field without compression and a second field without compression,
The step of configuring the transmission group,
And configuring data of the first field and data of the second field into different transmission groups.
Data compression transfer method.

According to claim 1,
The step of transmitting the divided,
And transmitting location information indicating the location of the first transmission group on the original data together with the first transmission group.
Data compression transfer method.

According to claim 1,
The first transmission group among the configured transmission groups includes a plurality of field values belonging to the same field,
The step of transmitting the divided,
Determining whether the plurality of field values match one overlapping pattern;
In response to determining that there is no match, separating the first transmission group into sub transmission groups based on a duplicate pattern; And
And dividing and transmitting the first transmission group in units of the sub transmission groups.
Data compression transfer method.

According to claim 1,
The original data includes a field separator for distinguishing the plurality of fields,
The step of transmitting the divided,
And excluding a first field separator located between transmission groups and a second field separator located between deduplicated field values, and transmitting the data.
Data compression transfer method.

Communication interface;
A memory containing one or more instructions; And
By executing the one or more instructions,
By receiving original data consisting of a set of records having a plurality of fields, and removing duplicate field values for each field of the original data, compressed data for the original data is obtained, and the obtained compressed data is communicated. Includes a processor transmitting through the interface,
The processor,
Characterized in that, by grouping field values located adjacent to each other on the original data, a transmission group is formed, and the compressed data is divided into the transmission group units and transmitted.
Data compression transmission device.

Combined with computing devices,
Receiving original data composed of a set of records having a plurality of fields;
Obtaining compressed data for the original data by removing duplicate field values for each field of the original data; And
Executing the step of transmitting the obtained compressed data,
The transmitting step,
Grouping field values located adjacent to each other on the original data to form a transmission group; And
Comprising the step of transmitting the compressed data in units of the transmission group, stored in a computer-readable recording medium,
Computer program.