KR20220027573A

KR20220027573A - Method and apparatus for compressing data based on deep learning

Info

Publication number: KR20220027573A
Application number: KR1020200108621A
Authority: KR
Inventors: 이성진; 김정균
Original assignee: 재단법인대구경북과학기술원
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2022-03-08
Also published as: KR102500904B1

Abstract

The present invention relates to a method for compressing data and a device thereof. According to an embodiment of the present invention, a data compressing method comprises the following steps of: forming a plurality of categories made of similar data blocks for at least some data blocks in a data storage; allocating a label to each category; training a learning model for determining a category, to which a new data block belongs, by using labeled categories; receiving the new data block; selecting a suitable category for the new data block by using the learning model; and performing delta compression of the new data block based on a representative data block of the selected category.

Description

Deep learning-based data compression method and data compression device

본 발명은 딥러닝을 기반으로 한 데이터 압축 방법 및 데이터 압축 장치에 관한 것이다. 보다 구체적으로, 본 발명은 델타 압축비를 이용하여 유사 데이터 블록들이 모인 복수의 클러스터들을 형성하고, 이들 클러스터들을 이용하여 훈련된 학습모델을 통해 신규 데이터 블록의 유형을 분석하고 가장 유사한 데이터 블록을 기준으로 델타 압축을 수행하는 방법 및 장치에 관한 것이다.The present invention relates to a data compression method and data compression apparatus based on deep learning. More specifically, the present invention forms a plurality of clusters in which similar data blocks are gathered using a delta compression ratio, and analyzes the type of a new data block through a learning model trained using these clusters, and based on the most similar data block A method and apparatus for performing delta compression.

현대 사회에서는 개인들 및 센서들에 의해 매일 대량의 디지털 데이터가 생성된다. 페이스북은 100 TB의 신규 데이터가 매일 업로드된다고 보고하고 있으며, YouTube는 매분(every minute)마다 48시간의 신규 비디오가 저장된다고 보고하고 있다.In modern society, a large amount of digital data is generated every day by individuals and sensors. Facebook reports that 100 TB of new data is uploaded daily, and YouTube reports that it stores 48 hours of new video every minute.

데이터 센터에서 저장되는 데이터의 양을 감소시키고, 더 나은 TCO(Total Cost of Owenership)를 실현시키기 위해서, 다양한 데이터 감소 테크닉이 제안되어 왔다. 그들 중에서, 데이터 압축과 데이터 중복제거라는 2개의 방법이 효과적인 저장 용량 향상 방안으로 유망하게 사용되고 있다.In order to reduce the amount of data stored in the data center and to realize a better Total Cost of Owenership (TCO), various data reduction techniques have been proposed. Among them, two methods, data compression and data deduplication, are promisingly used as effective storage capacity improvement methods.

데이터 압축은 무손실 압축 알고리즘들(예를 들어, LZ-기반 알고리즘들)을 이용하여 원래 데이터를 인코딩함으로써 디스크 내에 물리적으로 저장되는 데이터의 크기를 감소시키려는 시도이다.Data compression is an attempt to reduce the size of data physically stored on disk by encoding the original data using lossless compression algorithms (eg, LZ-based algorithms).

데이터 중복제거는 중복 블록들(다른 컨텐츠와 동일한 콘텐츠를 가지는 블록들)이 디스크에 기록되는 것을 방지함으로써 효과적인 저장 용량 향상을 달성하려는 시도이다. Data deduplication is an attempt to achieve effective storage capacity improvement by preventing duplicate blocks (blocks having the same content as other content) from being written to disk.

이러한 2개의 방법들은 일정 정도의 범위에서 효과적이긴 하지만 내재적인 한계들을 가지고 있다. 먼저, 데이터 압축은 이미지와 비디오와 같이 이미 압축되고, 높은 엔트로피(entropy)를 가지는 데이터의 크기를 감소시키는데 있어서 효과적이지 못하다. 데이터 중복제거는 입력 데이터의 엔트로피에 의존하지는 않지만, 동일한 블록들이 자주 기록될 때에만 효과적인 방법이다. 즉, 데이터 중복제거 방식은 입력되는 블록과 기준 블록(디스크에 이전에 기록된 블록) 사이에 1 비트의 차이만 존재하더라도 중복 비트 패턴을 제거하지 못할 수 있다.Although these two methods are effective to a certain extent, they have inherent limitations. First, data compression is ineffective in reducing the size of data that has already been compressed, such as images and videos, and has high entropy. Data deduplication does not depend on the entropy of the input data, but is only effective when identical blocks are frequently written. That is, the data deduplication method may not be able to remove the redundant bit pattern even if there is only a 1-bit difference between the input block and the reference block (block previously recorded on the disk).

위와 같은 한계들을 해결하기 위해, 데이터 압축과 데이터 중복제거 모두의 장점을 이용하는 새로운 데이터 압축 방법이 제시될 필요가 있다.In order to solve the above limitations, a new data compression method using the advantages of both data compression and data deduplication needs to be proposed.

전술한 배경기술은 발명자가 본 발명의 도출을 위해 보유하고 있었거나, 본 발명의 도출 과정에서 습득한 기술 정보로서, 반드시 본 발명의 출원 전에 일반 공중에게 공개된 공지기술이라 할 수는 없다. The above-mentioned background art is technical information possessed by the inventor for derivation of the present invention or acquired in the process of derivation of the present invention, and cannot necessarily be said to be a known technique disclosed to the general public prior to the filing of the present invention.

미국 등록특허공보 제10,714,058호(2020.07.14)US Registered Patent Publication No. 10,714,058 (2020.07.14)

본 발명의 일 과제는, 데이터 압축과 데이터 중복제거 모두의 장점을 이용하는 새로운 데이터 압축 방법으로서 보다 높은 델타 압축비가 달성될 수 있도록 압축 대상이 되는 데이터에 대한 최적의 기준 데이터를 찾는 방법을 제공하는 것이다.It is an object of the present invention to provide a method of finding optimal reference data for data to be compressed so that a higher delta compression ratio can be achieved as a new data compression method using the advantages of both data compression and data deduplication. .

본 발명의 다른 과제는, 고전적인 휴리스틱 기반의 델타 압축에서 압축비를 높이기 위한 유사 데이터를 찾지 못하는 문제를 해결하기 위한 새로운 델타 압축 알고리즘을 제공하는 것이다.Another object of the present invention is to provide a new delta compression algorithm for solving the problem of not finding similar data for increasing the compression ratio in the classical heuristic-based delta compression.

본 발명의 또 다른 과제는, 제한된 연산 리소스와 시간 내에서 가장 효과적인 델타 압축비를 달성하기 위한 최적의 기준 데이터를 탐색하고, 빠른 시간 내에 데이터 압축이 수행될 수 있도록 하는 데이터 압축 장치 및 방법을 제공하는 것이다.Another object of the present invention is to provide an apparatus and method for data compression that searches for optimal reference data to achieve the most effective delta compression ratio within limited computational resources and time, and enables data compression to be performed within a short time. will be.

본 발명이 해결하고자 하는 과제는 이상에서 언급한 과제에 한정되지 않으며, 언급되지 않은 본 발명의 다른 과제 및 장점들은 하기의 설명에 의해서 이해될 수 있고, 본 발명의 실시예에 의해보다 분명하게 이해될 것이다. 또한, 본 발명이 해결하고자 하는 과제 및 장점들은 특허 청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 알 수 있을 것이다.The problem to be solved by the present invention is not limited to the above-mentioned problems, and other problems and advantages of the present invention that are not mentioned can be understood by the following description, and more clearly understood by the embodiments of the present invention will be In addition, it will be appreciated that the problems and advantages to be solved by the present invention can be realized by means and combinations thereof indicated in the claims.

본 발명의 일 실시예에 따른 데이터 압축 방법은, 신규로 입력되는 데이터가 데이터 스토리지 내에 저장되어 있는 데이터 유형 중 어느 유형에 가장 가까운지를 판단하는 딥러닝 기반의 학습모델을 생성하고, 생성된 학습모델을 통해 신규 입력된 데이터가 최적의 기준 데이터에 기초하여 델타 압축되도록 하는 것을 특징으로 할 수 있다.The data compression method according to an embodiment of the present invention generates a deep learning-based learning model that determines which type of data stored in a data storage is closest to newly input data, and the generated learning model It may be characterized in that the newly input data is delta-compressed based on the optimal reference data.

본 발명의 다른 실시예에 따른 데이터 압축 방법은, 데이터 스토리지 내의 적어도 일부 데이터 블록들에 대해 유사 데이터 블록들로 이루어지는 복수의 카테고리들을 형성하는 단계, 각각의 카테고리에 레이블을 할당하는 단계, 신규 데이터 블록이 속하는 카테고리를 판단하기 위한 학습모델을 레이블링된 카테고리들을 이용하여 훈련시키는 단계, 신규 데이터 블록을 입력 받는 단계, 학습모델을 이용하여 상기 신규 데이터 블록에 적합한 카테고리를 선택하는 단계, 및 선택된 카테고리의 대표 데이터 블록에 기초하여 상기 신규 데이터 블록의 델타 압축을 실행하는 단계를 포함할 수 있다.A data compression method according to another embodiment of the present invention includes forming a plurality of categories including similar data blocks for at least some data blocks in a data storage, allocating a label to each category, and a new data block Training a learning model for determining the category to which it belongs using labeled categories, receiving a new data block, selecting a category suitable for the new data block using the learning model, and representative of the selected category performing delta compression of the new data block based on the data block.

여기에서, 카테고리들을 형성하는 단계는, 데이터 스토리지 내의 적어도 일부의 데이터 블록들 간에 델타 압축비(Delta Compression Ratio)를 연산하는 단계, 및 연산된 델타 압축비에 기초하여 유사 데이터 블록들로 이루어지는 복수의 카테고리들을 형성하는 단계를 포함할 수 있다.Here, the forming of the categories includes calculating a delta compression ratio between at least some data blocks in the data storage, and dividing a plurality of categories consisting of similar data blocks based on the calculated delta compression ratio. It may include the step of forming.

또한, 레이블을 할당하는 단계는, 각각의 카테고리에서 대표 데이터 블록을 선정하는 단계를 포함하고, 대표 데이터 블록을 선정하는 단계는, 각각의 카테고리에서 평가 대상이 되는 데이터 블록이 다른 데이터 블록들과 가지는 델타 압축비들을 연산하는 단계, 및 연산된 델타 압축비들을 기초로 각각의 카테고리에서 대표 데이터 블록을 선정하는 단계를 포함할 수 있다.In addition, the step of allocating the label includes the step of selecting a representative data block in each category, and the selecting the representative data block includes a data block to be evaluated in each category that is different from other data blocks. It may include calculating delta compression ratios, and selecting a representative data block in each category based on the calculated delta compression ratios.

이에 더하여, 신규 데이터 블록에 적합한 카테고리를 선택하는 단계는, 학습모델을 이용하여 신규 데이터 블록에 적합할 것으로 예상되는 후보 카테고리들을 선정하는 단계, 후보 카테고리들 각각의 대표 데이터 블록들과 신규 데이터 블록의 델타 압축비를 계산하는 단계, 및 델타 압축비가 가장 높은 대표 데이터 블록이 속한 카테고리를 상기 신규 데이터 블록에 최적 카테고리로 선택하는 단계를 포함할 수 있다.In addition, the step of selecting a category suitable for the new data block includes selecting candidate categories expected to be suitable for the new data block using a learning model, representative data blocks of each of the candidate categories and the new data block The method may include calculating a delta compression ratio, and selecting a category to which a representative data block having the highest delta compression ratio belongs as an optimal category for the new data block.

여기서, 훈련된 학습모델은 신규 데이터 블록을 입력 받으면, 신규 데이터 블록에 적합할 것으로 예상되는 후보 카테고리들의 리스트와 각각의 후보 카테고리들이 최적 카테고리일 확률을 함께 출력하도록 구성될 수 있다.Here, when the trained learning model receives a new data block, it may be configured to output a list of candidate categories expected to be suitable for the new data block and a probability that each candidate category is an optimal category.

보다 구체적으로, 델타 압축을 수행하는 단계는, 대표 데이터 블록과 신규 데이터 블록에 대해 XOR 연산을 수행하는 단계, 및 XOR 연산의 결과를 압축하는 단계를 포함할 수 있다.More specifically, the performing delta compression may include performing an XOR operation on the representative data block and the new data block, and compressing a result of the XOR operation.

한편, 본 개시의 또 다른 실시예에 따른 데이터 압축 방법은, 위에서 설명된 바와 같은 미리 설정된 수의 신규 데이터 입력, 적합한 카테고리 선택 및 델타 압축 실행을 수행한 이후에, 수행 중 가장 많이 선택된 카테고리의 대표 데이터 블록을 버퍼 캐시에 저장하는 단계를 더 포함할 수 있다.Meanwhile, in the data compression method according to another embodiment of the present disclosure, after performing a preset number of new data input, appropriate category selection, and delta compression execution as described above, representative of the most selected category during execution The method may further include storing the data block in a buffer cache.

이 외에도, 본 발명을 구현하기 위한 다른 방법, 다른 시스템 및 상기 방법을 실행하기 위한 컴퓨터 프로그램이 저장된 컴퓨터로 판독 가능한 기록매체가 더 제공될 수 있다.In addition to this, another method for implementing the present invention, another system, and a computer-readable recording medium storing a computer program for executing the method may be further provided.

전술한 것 외의 다른 측면, 특징, 이점이 이하의 도면, 특허청구범위 및 발명의 상세한 설명으로부터 명확해질 것이다.Other aspects, features and advantages other than those described above will become apparent from the following drawings, claims, and detailed description of the invention.

본 발명에 의하면, 데이터 스토리지에 신규로 입력되는 데이터에 대해 보다 높은 델타 압축비가 달성되어 보다 효율적인 데이터 스토리지 운영이 가능해질 수 있다.According to the present invention, a higher delta compression ratio can be achieved for data newly input to the data storage, thereby enabling more efficient data storage operation.

또한, 본 발명에 의하면, 고전적인 휴리스틱 기반이 아닌 딥러닝 기반으로 델타 압축을 위한 최적의 기준 데이터를 발견할 수 있도록 하여, 전체적인 데이터 압축 효율이 향상될 수 있다.In addition, according to the present invention, it is possible to discover optimal reference data for delta compression based on deep learning instead of based on classical heuristics, thereby improving overall data compression efficiency.

또한, 본 발명에 의하면, 제한된 연산 리소스와 시간 내에서 가장 효과적인 델타 압축비를 달성할 수 있어, 낮은 비용으로 빠른 시간 내에 데이터 압축이 수행할 수 있다. In addition, according to the present invention, it is possible to achieve the most effective delta compression ratio within limited computational resources and time, so that data compression can be performed in a short time at a low cost.

본 발명의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 델타 압축 메커니즘의 전체 단계를 간략히 도시한다.
도 2는 본 개시의 일 실시예에 따른 데이터 압축 장치를 간략히 도시한다.
도 3은 본 개시의 일 실시예에 따른 데이터 압축 방법을 전체적으로 설명하기 위한 도면이다.
도 4는 본 개시의 일 실시예에 따른 데이터 압축 방법에서 데이터 스토리지 내에 존재하는 데이터 블록들을 군집화하고 레이블링하는 단계를 설명하기 위한 도면이다.
도 5는 본 개시의 일 실시예에 따른 데이터 압축 방법에서 학습 페이즈를 설명하기 위한 도면이다.
도 6은 본 개시의 일 실시예에 따른 데이터 압축 방법에서 추정 페이즈를 설명하기 위한 도면이다.
도 7은 본 개시의 일 실시예에 따른 데이터 압축 방법의 순서도를 도시한다.
도 8은 본 개시의 일 실시예에 따른 데이터 압축 방법의 레이블링 페이즈의 순서도를 도시한다.
도 9는 본 개시의 일 실시예에 따른 데이터 압축 방법의 추정 페이즈의 순서도를 도시한다.
도 10은 본 개시의 일 실시예에 따른 데이터 압축 방법과 비교되는 최적 델타 압축을 설명하기 위한 도면이다.
도 11은 본 개시의 일 실시예에 따른 데이터 압축 방법과 기존 압축 방법 중 일부 방식의 효율을 비교하는 테이블을 도시한다.
도 12는 본 개시의 일 실시예에 따른 데이터 압축 방법과 기존 압축 방법 중 다른 방식의 효율을 비교하는 테이블을 도시한다.
도 13은 본 개시의 일 실시예에 따른 데이터 압축 방법과 기존 압축 방법 중 일부 방식의 압축비를 비교하는 그래프를 도시한다.1 schematically shows the overall steps of the delta compression mechanism.
2 schematically illustrates a data compression apparatus according to an embodiment of the present disclosure.
3 is a diagram for explaining a data compression method according to an embodiment of the present disclosure as a whole.
4 is a diagram for explaining a step of grouping and labeling data blocks existing in a data storage in a data compression method according to an embodiment of the present disclosure.
5 is a diagram for explaining a learning phase in a data compression method according to an embodiment of the present disclosure.
6 is a diagram for explaining an estimation phase in a data compression method according to an embodiment of the present disclosure.
7 is a flowchart of a data compression method according to an embodiment of the present disclosure.
8 is a flowchart illustrating a labeling phase of a data compression method according to an embodiment of the present disclosure.
9 is a flowchart illustrating an estimation phase of a data compression method according to an embodiment of the present disclosure.
10 is a diagram for explaining optimal delta compression compared to a data compression method according to an embodiment of the present disclosure.
11 is a table for comparing the efficiency of some of the data compression method and the existing compression method according to an embodiment of the present disclosure.
12 is a table for comparing the efficiency of the data compression method according to an embodiment of the present disclosure and another of the existing compression methods.
13 is a graph illustrating a comparison between a data compression method according to an embodiment of the present disclosure and a compression ratio of some of the existing compression methods.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 설명되는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 아래에서 제시되는 실시예들로 한정되는 것이 아니라, 서로 다른 다양한 형태로 구현될 수 있고, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 아래에 제시되는 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Advantages and features of the present invention and methods of achieving them will become apparent with reference to the detailed description in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments presented below, but may be implemented in various different forms, and should be understood to include all transformations, equivalents, and substitutes included in the spirit and scope of the present invention. do. The embodiments presented below are provided to complete the disclosure of the present invention, and to fully inform those of ordinary skill in the art to the scope of the invention. In describing the present invention, if it is determined that a detailed description of a related known technology may obscure the gist of the present invention, the detailed description thereof will be omitted.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.The terms used in the present application are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It is to be understood that this does not preclude the possibility of the presence or addition of numbers, steps, operations, components, parts, or combinations thereof. Terms such as first, second, etc. may be used to describe various elements, but the elements should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

이하, 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세히 설명하기로 하며, 첨부 도면을 참조하여 설명함에 있어, 동일하거나 대응하는 구성 요소는 동일한 도면번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings, and in the description with reference to the accompanying drawings, the same or corresponding components are given the same reference numerals, and overlapping descriptions thereof are omitted. decide to do

도 1은 델타 압축 메커니즘의 전체 단계를 간략히 도시한다. 1 schematically shows the overall steps of the delta compression mechanism.

신규한 데이터 블록(입력 블록)이 들어오면, 델타 압축 메카니즘은 먼저 디스크를 탐색하여 신규 블록과 유사한 비트 패턴을 가지는 기준 블록을 찾는다. 그 후, 기준 블록과 입력 블록 사이에 XOR 연산을 수행하고 XOR 연산의 결과인 델타에 대해서만 압축을 수행한다. 이러한 방식으로, 델타 압축 기법은 입력 데이터의 엔트로피에 독립적으로 양호한 압축비를 제공할 수 있다. 추가적으로, 신규한 블록과 기준 블록이 정확하게 매칭되지 않더라도 2개 블록들 모두에 존재하는 중복 비트 패턴들이 효율적으로 제거될 수 있다.When a new data block (input block) comes in, the delta compression mechanism first scans the disk to find a reference block with a bit pattern similar to the new block. Thereafter, an XOR operation is performed between the reference block and the input block, and compression is performed only on the delta that is the result of the XOR operation. In this way, the delta compression technique can provide a good compression ratio independent of the entropy of the input data. Additionally, even if the new block and the reference block do not exactly match, duplicate bit patterns existing in both blocks can be efficiently removed.

이러한 델타 압축에서 주요한 기술적 도전은 최적의 기준 블록을 발견하는 것이다. 예를 들어, 1 TB SSD는 268M 개의 4-KB 데이터 블록들을 저장하고 있다. 이러한 데이터 블록들 중에서 가장 유사한 데이터 블록을 식별하기 위해 모든 데이터 블록들을 스캐닝하는 것은 극단적으로 높은 오버헤드로 인해 실질적으로 불가능하다. A major technical challenge in such delta compression is to find an optimal reference block. For example, a 1 TB SSD stores 268M 4-KB data blocks. Scanning all data blocks to identify the most similar among these data blocks is practically impossible due to the extremely high overhead.

종래의 연구들은 유망한(popular) 블록들의 제한된 개수의 데이터 블록들 중에 기준 블록을 선택하는 휴리스틱-기반의 접근방식을 제안했다. 그러나, 이러한 접근방식에서는 압축비의 열화(degradation)가 불가피했다. 로컬리티-센서티브 해쉬(Locality-Sensitive Hash, LSH) 함수를 이용하는 다른 방식은 압축 효율을 향상시키지만, 이러한 방식에 따른 압축비도 이론적으로 달성가능한 최적 압축비에 비하면 현저히 낮은 효율을 보였다.Previous studies have proposed a heuristic-based approach for selecting a reference block among a limited number of data blocks of popular blocks. However, degradation of the compression ratio was unavoidable in this approach. Another method using the Locality-Sensitive Hash (LSH) function improves the compression efficiency, but the compression ratio according to this method also showed significantly lower efficiency than the theoretically achievable optimal compression ratio.

본 개시에서는 딥러닝이 조력하는 신규한 델타 압축 알고리즘을 제안하며, 이러한 알고리즘은 여기서 DeepComp라고 임의로 지칭될 수 있다. The present disclosure proposes a novel delta compression algorithm aided by deep learning, which algorithm may be arbitrarily referred to herein as DeepComp.

일반적으로, 델타 압축은 입력 데이터 패턴들에 덜 의존적이기 때문에 통상적인 무손실 압축(예를 들어, LZ-기반 알고리즘들) 및 중복제거(Deduplication) 알고리즘들보다 더 높은 데이터 압축비(Compression Ratio)를 제공하는 것으로 알려져 있다. In general, because delta compression is less dependent on input data patterns, it provides a higher data compression ratio than conventional lossless compression (e.g., LZ-based algorithms) and deduplication algorithms. it is known

그러나, 고전적인 휴리스틱(heuristics)을 기반으로 한 기존 델타 압축은 최적 압축비보다 상당히 낮은 압축비를 제공한다. 기존 델타 압축은 디스크 내에서 유사한 블록들을 발견하는데 실패하며, 이에 따라 대부분의 경우 최적의 경우보다 낮은 성능을 보이게 된다. However, conventional delta compression based on classical heuristics provides a compression ratio significantly lower than the optimal compression ratio. Conventional delta compression fails to find similar blocks in the disk, and thus, in most cases, shows lower-than-optimal performance.

본 개시에서는, 최근의 딥러닝 알고리즘을 이용하여 가장 유사한 데이터 블록들을 발견하도록 함으로써, DeepComp가 최적 델타 압축이 제공할 수 있는 이론적 한계에 가까운 훌륭한 압축비를 달성할 수 있다는 것을 설명할 것이다.In this disclosure, we will demonstrate that by using a recent deep learning algorithm to find the most similar data blocks, DeepComp can achieve a good compression ratio close to the theoretical limit that optimal delta compression can provide.

본 개시에서는, DeepComp라 지칭하는 딥러닝 기반의 신규한 압축 기법을 제안한다. 본 개시의 DeepComp는 델타 압축에 기초한 것이지만, 최신의 딥러닝 알고리즘들을 이용한다는 점에서 기존의 델타 압축 방식과는 근본적인 차이를 가진다. DeepComp의 기본적인 아이디어는 컨텐트-기반 이미지 검색(retrieval)과 같은 딥러닝 애플리케이션을 이용한 것으로, 신규한 데이터 블록이 주어지면 DeepComp는 신경망(예를 들어, CNN 또는 FCN)을 통해 사전 훈련된 모델을 이용하여 블록의 카테고리를 추출할 수 있다. 그 후, DeepComp는 해당 카테고리에 속하는 대표적인 기준 블록(기존에 기재되어 있던)을 선택하고, 신규한 데이터 블록과 기준 블록이 유사한 비트 패턴을 가진다는 가정 하에 델타 압축을 수행할 수 있다.In the present disclosure, a novel compression technique based on deep learning called DeepComp is proposed. Although DeepComp of the present disclosure is based on delta compression, it has a fundamental difference from the existing delta compression method in that it uses the latest deep learning algorithms. The basic idea behind DeepComp is to use deep learning applications such as content-based image retrieval, given a new block of data, DeepComp uses a pre-trained model through a neural network (e.g. CNN or FCN) to You can extract the categories of blocks. Then, DeepComp selects a representative reference block (previously described) belonging to the corresponding category, and performs delta compression on the assumption that the new data block and the reference block have similar bit patterns.

DeepComp를 설계하는데 있어서 주의 깊게 고려되어야만 하는 기술적 도전들이 존재하는데, 주요한 문제 중 하나는 이미지와 음성 컨텐츠를 위해 최적화되어 설계된 기존의 신경망과는 달리, DeepComp는 이진 데이터(즉, 비트들의 어레이)를 직접 다루어야만 하고, 이는 다음의 2가지 도전들을 가져오기 때문이다. There are technical challenges that must be carefully considered in designing DeepComp. One of the main problems is that unlike conventional neural networks that are designed to be optimized for image and audio content, DeepComp directly manipulates binary data (i.e., an array of bits). must be addressed, as this brings two challenges:

첫번째로, 이미지 식별 문제 등과는 달리 잘 정의된 카테고리(예를 들어, 고양이, 개, 원숭이 등)가 데이터 유사도 판단을 위한 신경망에서는 존재하지 않는다. 대신, 아주 큰 수의 비트 패턴들의 가능한 조합들이 존재한다(예를 들어, 4 KB 데이터 블록에 대한 2⁴⁰⁹⁶개의 유니크 비트 패턴들이 존재). 따라서, 유사한 비트 패턴들을 가지는 블록들을 동일한 카테고리로 그룹핑하고 레이블링하는 작업이 주요한 디자인 이슈가 된다.First, unlike the image identification problem, well-defined categories (eg, cat, dog, monkey, etc.) do not exist in the neural network for data similarity determination. Instead, there are a very large number of possible combinations of bit patterns (eg, there are 2 ⁴⁰⁹⁶ unique bit patterns for a 4 KB data block). Accordingly, grouping and labeling blocks having similar bit patterns into the same category becomes a major design issue.

두번째로, 기존의 딥러닝 알고리즘들이 이진 데이터에도 잘 적용될지가 불분명하다는 것이다. 적합한 신경망을 찾기 위한 다량의 평가가 요구되고, 만약 필요하다면, 이진 데이터에 적용하기 위해 기존 네트워크들을 수정하고 리팩터링해야할 수도 있다. Second, it is unclear whether existing deep learning algorithms can be applied well to binary data. A large amount of evaluation is required to find a suitable neural network, and if necessary, it may be necessary to modify and refactor existing networks to apply to binary data.

DNN 추정의 성능은 다른 주요 도전과제 중 하나이며, DeepComp는 추정 페이즈에서 델타 압축을 수행하기 위해 추가적인 CPU 사이클들 뿐만 아니라 추가적인 I/O를 요구하게 될 수 있다.The performance of DNN estimation is one of the other major challenges, and DeepComp may require additional CPU cycles as well as additional I/O to perform delta compression in the estimation phase.

다음에서, 위의 문제들을 해결하기 위한 DeepComp 아키텍쳐와 본 개시의 전략이 제시되고, 본 개시에 의해 개발된 시스템을 객관적으로 평가하기 위한 간단한 계획도 설명될 것이다.In the following, the DeepComp architecture for solving the above problems and the strategy of the present disclosure will be presented, and a simple scheme for objectively evaluating the system developed by the present disclosure will also be described.

도 2는 본 개시의 일 실시예에 따른 데이터 압축 장치를 간략히 도시한다.2 schematically illustrates a data compression apparatus according to an embodiment of the present disclosure.

본 개시의 실시예에 따른 데이터 압축 방법인 DeepComp를 수행하기 위한 데이터 압축 장치는 프로세서(100), 프로세서(100)와 연결된 메모리(110), 캐시 메모리(120) 및 복수의 데이터 스토리지(200)를 포함할 수 있다. 여기서, 데이터 스토리지는 디스크라고 지칭될 수도 있다.A data compression apparatus for performing DeepComp, which is a data compression method according to an embodiment of the present disclosure, includes a processor 100 , a memory 110 connected to the processor 100 , a cache memory 120 , and a plurality of data storage 200 . may include Here, the data storage may be referred to as a disk.

데이터 압축 장치는 데이터 센터에 배치되어 데이터 저장 장치의 동작을 제어할 수 있으며, 데이터 저장 장치로 입력되는 데이터 블록의 압축을 수행할 수 있다.The data compression device may be disposed in a data center to control an operation of the data storage device, and may compress data blocks input to the data storage device.

본 실시예에서 프로세서(100)는 데이터 압축 장치의 전체 동작을 제어할 수 있다. 프로세서(processor)는 데이터를 처리할 수 있는 모든 종류의 장치를 포함할 수 있다. 여기에서, '프로세서(processor)'는, 예를 들어 프로그램 내에 포함된 코드 또는 명령으로 표현된 기능을 수행하기 위해 물리적으로 구조화된 회로를 갖는, 하드웨어에 내장된 데이터 처리 장치를 의미할 수 있다. 이와 같이 하드웨어에 내장된 데이터 처리 장치의 일 예로써, 마이크로프로세서(microprocessor), 중앙처리장치(central processing unit: CPU), 프로세서 코어(processor core), 멀티프로세서(multiprocessor), ASIC(application-specific integrated circuit), FPGA(field programmable gate array) 등의 처리 장치를 망라할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다.In this embodiment, the processor 100 may control the overall operation of the data compression apparatus. A processor may include any kind of device capable of processing data. Here, the 'processor' may refer to a data processing device embedded in hardware, for example, having a physically structured circuit to perform a function expressed as a code or an instruction included in a program. As an example of the data processing device embedded in the hardware as described above, a microprocessor, a central processing unit (CPU), a processor core, a multiprocessor, an application-specific integrated (ASIC) circuit) and a processing device such as a field programmable gate array (FPGA), but the scope of the present invention is not limited thereto.

여기서, 메모리(110)는 자기 저장 매체(magnetic storage media) 또는 플래시 저장 매체(flash storage media)를 포함할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다. 이러한 메모리(140)는 내장 메모리 및/또는 외장 메모리를 포함할 수 있으며, DRAM, SRAM, 또는 SDRAM 등과 같은 휘발성 메모리, OTPROM(one time programmable ROM), PROM, EPROM, EEPROM, mask ROM, flash ROM, NAND 플래시 메모리, 또는 NOR 플래시 메모리 등과 같은 비휘발성 메모리, SSD. CF(compact flash) 카드, SD 카드, Micro-SD 카드, Mini-SD 카드, Xd 카드, 또는 메모리 스틱(memory stick) 등과 같은 플래시 드라이브, 또는 HDD와 같은 저장 장치를 포함할 수 있다.Here, the memory 110 may include magnetic storage media or flash storage media, but the scope of the present invention is not limited thereto. Such memory 140 may include internal memory and/or external memory, volatile memory such as DRAM, SRAM, or SDRAM, one time programmable ROM (OTPROM), PROM, EPROM, EEPROM, mask ROM, flash ROM, Non-volatile memory, such as NAND flash memory, or NOR flash memory, SSD. It may include a flash drive such as a compact flash (CF) card, an SD card, a Micro-SD card, a Mini-SD card, an Xd card, or a memory stick, or a storage device such as an HDD.

또한, 본 개시의 일 실시예에 따른 데이터 압축 장치의 메모리(110)에는 이하에서 설명될 데이터 압축 방법을 수행하기 위한 명령어들을 포함하여 데이터 압축 장치의 동작에 관련된 프로그램들이 저장되어 있을 수 있다.In addition, programs related to the operation of the data compression apparatus may be stored in the memory 110 of the data compression apparatus according to an embodiment of the present disclosure, including instructions for performing a data compression method to be described below.

메모리(110)에 저장된 명령어들 또는 프로그램은 프로세서(100)로 하여금 데이터 센터를 운영 정책에 따라 데이터 저장 장치를 제어하기 위한 동작들을 수행하도록 할 수 있다. Instructions or programs stored in the memory 110 may cause the processor 100 to perform operations for controlling a data storage device according to a data center operation policy.

본 개시의 일 실시예에 따른 데이터 압축 장치의 캐시 메모리(120)는 버퍼 캐시로서 기능할 수 있으며, 캐시 메모리(120)에는 데이터 스토리지에 저장할 데이터를 압축하는데 기준이 되는 기준 데이터 블록들 중 자주 호출되는 데이트 블록들이 저장될 수 있다. The cache memory 120 of the data compression apparatus according to an embodiment of the present disclosure may function as a buffer cache, and the cache memory 120 is frequently called among reference data blocks as a reference for compressing data to be stored in the data storage. Data blocks to be used may be stored.

이러한 캐시 메모리(120)에는 카테고리 레이블에 따른 대표 데이터 블록이 저장되어 있으며, 이하에서 설명되는 학습모델의 적용에 의해 신규 입력 데이터가 속할 카테고리가 결정되면, 캐시 메모리(120)를 활용하여 보다 신속하게 이후의 델타 압축 단계를 위한 기준 데이터 블록이 로딩될 수 있다.Representative data blocks according to category labels are stored in the cache memory 120 , and when a category to which new input data belongs is determined by application of a learning model described below, the cache memory 120 is used to more rapidly A reference data block for a subsequent delta compression step may be loaded.

데이터 스토리지(200)는 복수의 데이터 스토리지들(200a, 200b, … 200m)을 포함할 수 있으며, 각각의 데이터 스토리지에 포함되는 데이터 블록들이 다르기 때문에 이하에서 설명되는 데이터 압축 방법의 레이블링 페이즈, 학습 페이즈 및 추정 페이즈가 각각의 데이터 스토리지에 대해 각각 수행될 수 있다.The data storage 200 may include a plurality of data storages 200a, 200b, ... 200m, and since the data blocks included in each data storage are different, a labeling phase and a learning phase of the data compression method described below are different. and an estimation phase may be respectively performed for each data storage.

도 3은 본 개시의 일 실시예에 따른 데이터 압축 방법을 전체적으로 설명하기 위한 도면이다.3 is a diagram for explaining a data compression method according to an embodiment of the present disclosure as a whole.

도 3은 본 개시의 일 실시예에 따른 데이터 압축 방법인 DeepComp의 전체적인 아키텍쳐를 도시하는데, DeepComp는 델타 압축을 위한 데이터 블록들을 다루기 위해 (1) 레이블링, (2) 학습, 그리고 (3) 추정의 3개의 주요 페이즈들(phases)로 이루어질 수 있다.3 shows the overall architecture of DeepComp, which is a data compression method according to an embodiment of the present disclosure, in which (1) labeling, (2) learning, and (3) estimation to handle data blocks for delta compression. It may consist of three main phases.

레이블링 페이즈는 데이터 블록들의 비트 패턴들에 따라 데이터 블록들을 레이블링하는 단계이다. 레이블링 페이즈 동안, DeepComp는 디스크 또는 데이터 스토리지 내에 저장된 데이터 블록들 간의 거리(유사성)를 측정하고, 유사한 비트 패턴들을 가지는 데이터 블록들을 동일한 카테고리로 그룹핑(클러스터링 또는 카테고라이징) 할 수 있다.The labeling phase is a step of labeling data blocks according to bit patterns of the data blocks. During the labeling phase, DeepComp measures the distance (similarity) between data blocks stored in disk or data storage, and can group (cluster or categorize) data blocks with similar bit patterns into the same category.

따라서, 동일한 카테고리(클러스터)에 속하는 데이터 블록들은 서로 간에 높은 데이터 압축비를 가지게 된다. 도 3에서 도시된 바와 같이 디스크(또는 데이터 스토리지) 내에 있는 데이터 블록들은 3개의 카테고리들, A, B, C로 레이블링될 수 있다. Accordingly, data blocks belonging to the same category (cluster) have a high data compression ratio. As shown in FIG. 3 , data blocks within a disk (or data storage) may be labeled with three categories, A, B and C.

이러한 정보를 이용하여 DeepComp는 신경망 학습모델(예를 들어, 목적에 적합한 가중치와 편향들)을 생성하기 위한 지도 학습 알고리즘을 실행할 수 있다. 이렇게 생성된 학습모델은 이후의 추정 페이즈에서 사용될 수 있다. 다른 신경망들과 같이, DeepComp는 전체 정확도가 인지가능할 정도로 떨어지는 경우 재훈련을 수행함으로써 그 학습모델의 성능을 향상시킬 수 있다.Using this information, DeepComp can run a supervised learning algorithm to create a neural network learning model (eg, weights and biases suitable for the purpose). The learning model generated in this way can be used in a subsequent estimation phase. Like other neural networks, DeepComp can improve the performance of its learning model by performing retraining when the overall accuracy drops perceptibly.

추정 페이즈에서는 실제 델타 압축이 수행될 수 있다. 신규한 데이터 블록이 입력되는 경우, DeepComp는 훈련된 학습모델을 이용하여 신규한 데이터 블록이 속할 카테고리를 추정할 수 있다. 학습모델은 신규한 데이터 블록이 속할 후보 카테고리들의 리스트와 각 카테고리의 확률을 출력하거나 신규한 데이터 블록에 가장 가까운 이웃들을 가리키는 해쉬 값을 생성할 수 있다. In the estimation phase, actual delta compression may be performed. When a new data block is input, DeepComp can estimate the category to which the new data block belongs by using the trained learning model. The learning model may output a list of candidate categories to which the new data block belongs and the probability of each category, or may generate hash values indicating nearest neighbors to the new data block.

도 3에서 도시된 예에서와 같이 신규 데이터 블록을 훈련된 학습모델에 입력한 결과는 신규 데이터 블록이 카테고리 B에 속할 확률은 95%, 카테고리 C에 속할 확률은 4%, 카테고리 A에 속할 확률은 1%라고 출력될 수 있다.As in the example shown in Fig. 3, the result of inputting the new data block into the trained learning model is that the probability that the new data block belongs to category B is 95%, the probability that it belongs to category C is 4%, and the probability of belonging to category A is It can be output as 1%.

신규한 데이터 블록에 대해 최적인 카테고리(예를 들어, 카테고리 B)가 선택되면, DeepComp는 디스크로부터 선택된 카테고리의 대표 블록을 판독하고 그 블록을 기준 블록으로 사용한다. DeepComp는 기준 블록을 이용하여 신규 데이터 블록에 대한 델타 압축을 수행하고 압축된 데이터를 디스크에 기록할 수 있다.When the optimal category (eg, category B) for the new data block is selected, DeepComp reads a representative block of the selected category from disk and uses that block as a reference block. DeepComp can perform delta compression on a new data block using the reference block and write the compressed data to disk.

도 4는 본 개시의 일 실시예에 따른 데이터 압축 방법에서 데이터 스토리지 내에 존재하는 데이터 블록들을 군집화하고 레이블링하는 단계를 설명하기 위한 도면이다.4 is a diagram for explaining a step of grouping and labeling data blocks existing in a data storage in a data compression method according to an embodiment of the present disclosure.

레이블링 페이즈에서의 주요 책임은 기존 데이터 블록들을 그들의 유사성에 따라서 카테고라이징(그룹핑 또는 클러스터링) 하는 것이다(도 3 참조). 기존의 딥러닝 알고리즘들에서는, 유사성이 입력 데이터(예를 들어, 이미지 또는 음성 데이터)의 주요 피쳐들에 의해 결정되었다. 반면에, DeepComp에서는 2개의 데이터 블록들의 유사성이 그들의 비트 패턴들에 직접적으로 연관될 수 있다. 일 실시예에서, 2개의 블록들에 대한 XOR 연산 결과의 델타 압축비(또는 엔트로피 값)가 2개 데이터 블록들 간의 유사성을 결정하는데 사용될 수 있다.The main responsibility in the labeling phase is to categorize (group or cluster) existing data blocks according to their similarity (see FIG. 3 ). In existing deep learning algorithms, similarity is determined by key features of input data (eg, image or voice data). On the other hand, in DeepComp, the similarity of two data blocks can be directly related to their bit patterns. In one embodiment, the delta compression ratio (or entropy value) of the XOR operation result for the two blocks may be used to determine the similarity between the two data blocks.

레이블링을 위한 가장 직접적인 방법은 데이터 블록들의 모든 가능한 조합들에 대해 델타 압축비를 측정하고 델타 압축비가 델타 거리라 불릴 수 있는 임의의 임계치보다 높으면 해당 쌍의 데이터 블록들을 동일한 카테고리로 분류하는 것이다. 어떤 데이터 블록은 다른 데이터 블록들 중 어느 것과도 임계치보다 높은 델타 압축비를 제공하지 못할 수 있다. 이러한 경우에, DeepComp는 해당 데이터 블록을 위한 신규한 카테고리를 생성하거나, 현재까지 생성된 카테고리들의 기준 데이터 블록들 중 해당 데이터 블록과 가장 높은 압축비를 보이는 기준 데이터 블록을 찾고 그 기준 데이터 블록이 속한 카테고리에 해당 데이터 블록이 속하도록 할 수 있다.The most direct method for labeling is to measure the delta compression ratio for all possible combinations of data blocks and classify the pair of data blocks into the same category if the delta compression ratio is higher than a certain threshold, which can be called the delta distance. Some data blocks may not provide a delta compression ratio higher than a threshold than none of the other data blocks. In this case, DeepComp creates a new category for the corresponding data block, or finds the reference data block showing the highest compression ratio to the corresponding data block among the reference data blocks of the categories created so far, and finds the category to which the reference data block belongs. You can make the data block belong to .

위와 같은 단계를 반복하면, 존재하는 데이터 블록들이 속하게 되는 카테고리들 또는 클러스터들이 자동적으로 형성될 수 있고, 가능한 카테고리들의 수가 정해질 수 있다. 각각의 카테고리는 고유한 레이블을 할당받게 되고, 레이블링된 카테고리들은 이후의 훈련 페이즈 및 추정 페이즈에서 사용되게 된다.By repeating the above steps, categories or clusters to which existing data blocks belong can be automatically formed, and the number of possible categories can be determined. Each category is assigned a unique label, and the labeled categories are used in subsequent training and estimation phases.

위와 같은 클러스터링 과정을 디스크 또는 데이터 스토리지 내의 모든 데이터 블록들에 대해 수행하는 것이 데이터 블록들을 레이블링하는데 가장 정확한 방법일 수 있지만, 높은 오버헤드를 고려하면 이러한 방식(완전한 클러스터링-exhaustive clustering)은 구현에 어려움이 있을 수도 있다. 예를 들어, n개의 데이터 블록들이 있다면 완전한 클러스터링을 위해서는 n²개의 데이터 블록 쌍들에 대해 델타 압축을 수행해야만 할 수 있다. 이러한 부담을 감소시키기 위해서는, 근사적 군집화 알고리즘이 레이블링을 위해 고려될 수 있다. Performing the above clustering process for all data blocks in the disk or data storage may be the most accurate method for labeling data blocks, but considering the high overhead, this method (complete clustering - exhaustive clustering) is difficult to implement There may be this. For example, if there are n data blocks, delta compression may have to be performed on n ² data block pairs for complete clustering. To reduce this burden, an approximate clustering algorithm can be considered for labeling.

또 다른 실시예에서, 클러스터링을 위해서 k-means 알고리즘들 및 셀프-오거나이징 맵(self-organizing map)의 변종들과 같이 클러스터링에 통상적으로 사용되는 비지도 학습 테크닉이 적용될 수도 있다.In another embodiment, for clustering, an unsupervised learning technique commonly used for clustering, such as k-means algorithms and variants of a self-organizing map, may be applied.

레이블링과 관련된 다른 문제는 적합한 델타 거리를 식별하는 것이다.Another problem with labeling is identifying a suitable delta distance.

만약 델타 거리가 너무 작게 설정된다면(즉, 델타 압축비의 임계치가 너무 높게 설정된다면) DeepComp는 너무 많은 수의 카테고리들을 만들 것이다. 데이터 블록들 간의 작은 비트 차이만으로도 새로운 카테고리가 만들어져야 하기 때문이고, 너무 많은 수의 카테고리는 학습 페이즈에서 부담으로 작용할 수 있다. If the delta distance is set too small (ie the threshold of the delta compression ratio is set too high) DeepComp will create too many categories. This is because a new category must be created with only a small bit difference between data blocks, and too many categories can act as a burden in the learning phase.

반대로, 델타 거리가 너무 크게 설정되면(예를 들어, 델타 압축비의 임계치가 너무 낮게 설정되면), 상당히 다른 비트 패턴들을 가진 데이터 블록들도 동일한 카테고리에 속하게 되어 전체 압축비가 열화될 수 있다. Conversely, if the delta distance is set too large (for example, if the threshold of the delta compression ratio is set too low), data blocks having significantly different bit patterns may also belong to the same category and the overall compression ratio may deteriorate.

여기서 델타 거리 또는 델타 압축비의 임계치가 사용자에 의해 미리 설정된 값으로 정해질 수 있거나(이 경우, 정해진 임계치에 의해 데이터 스토리지 내의 데이터 블록들의 카테고리 수가 결정됨), 군집화를 통해 나오게 되는 카테고리의 수가 사용자에 의해 미리 설정된 값으로 정해질 수도 있다(이 경우, 정해진 카테고리 수에 맞게 델타 거리 또는 델타 압축비의 임계치가 결정됨). Here, the threshold of the delta distance or delta compression ratio may be set to a value preset by the user (in this case, the number of categories of data blocks in the data storage is determined by the set threshold), or the number of categories coming out through clustering may be set by the user It may be determined as a preset value (in this case, the threshold of the delta distance or delta compression ratio is determined according to the predetermined number of categories).

데이터 스토리지 내의 각각의 데이터 블록들은 자신이 속한 카테고리에 따라 레이블(예를 들어, 레이블 A, 레이블 B, 레이블 C)을 가지게 되고, 레이블링된 데이터 블록들은 이후 지도 학습에서 사용되기 위한 훈련 데이터세트로 이용될 수 있다. Each data block in the data storage has a label (eg, label A, label B, label C) according to the category to which it belongs, and the labeled data blocks are used as a training dataset for later use in supervised learning. can be

도 5는 본 개시의 일 실시예에 따른 데이터 압축 방법에서 학습 페이즈를 설명하기 위한 도면이다.5 is a diagram for explaining a learning phase in a data compression method according to an embodiment of the present disclosure.

존재하는 데이터 블록들의 카테고리가 결정되고 각 카테고리에 레이블이 할당되면, 다음 단계에서는 신경망 모델의 지도 학습이 수행되어야 한다. After the categories of existing data blocks are determined and labels are assigned to each category, supervised learning of the neural network model should be performed in the next step.

신경망 모델에는 다양한 형태가 존재하고, CNN(Convolutional Neural Network), RNN(Recurrent Neural Networks), 피드포워드 ANN(Feedforward Artificial Neural Networks)을 포함하는 다양한 신경망 모델이 본 개시의 실시예에서 이용될 수 있다.Various types of neural network models exist, and various neural network models including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Feedforward Artificial Neural Networks (ANNs) may be used in an embodiment of the present disclosure.

학습 페이즈에서 주요 고려사항은 얼마나 효율적으로 이진 데이터를 다룰 수 있는가이다. 현대의 딥 뉴럴 네트워크는 주로 이미지 및 음성 인식을 위해 설계되어 왔고, 데이터 블록들과 같은 이진 데이터를 위해 설계되지 않았다. 따라서, 이진 데이터 내의 유사한 비트 패턴들을 검출하는데 있어서 기존의 신경망이 효과적이지 않을 수 있으며, 새로운 조정이 필요할 수 있다.A key consideration in the learning phase is how efficiently you can handle binary data. Modern deep neural networks have been mainly designed for image and speech recognition, and are not designed for binary data such as data blocks. Therefore, existing neural networks may not be effective in detecting similar bit patterns in binary data, and new adjustments may be required.

첫번째 단계에서는 다양한 신경망들을 탐색하고 유사한 비트 패턴들을 식별하는데 있어서 각각의 신경망이 가지는 효과를 측정할 필요가 있다. 예를 들어, CNN 알고리즘을 평가하는 경우에 있어서 각각의 4 KB 데이터 블록은 단일-컬러 공간(예를 들어, 그레이)을 가지는 64 x 64 의 이미지처럼 고려될 수 있고, CNN 모델은 4 KB 데이터 블록들을 마치 이미지인 것처럼 다루도록 직접적으로 사용될 수 있다. 일 실시예에서는 ResNet, Inception, YOLO 등과 같은 잘 알려진 CNN 모델들이 이용될 수 있다.In the first step, it is necessary to explore various neural networks and measure the effectiveness of each neural network in identifying similar bit patterns. For example, in the case of evaluating a CNN algorithm, each 4 KB data block can be considered as a 64 x 64 image with a single-color space (eg, gray), and the CNN model is a 4 KB data block They can be used directly to treat them as if they were images. In an embodiment, well-known CNN models such as ResNet, Inception, YOLO, etc. may be used.

CNN 알고리즘들의 정확성이 충분히 높지 않은 경우에는, CNN 알고리즘들의 일부 파트들을 최적화할 수 있다. 예를 들어, 신경망의 전처리 단계를 강화시켜서 데이터 블록들의 고유 비트 패턴들이 확대되도록 할 수도 있다. 이러한 전처리는 합성곱 층들(Convolutional Layers) 및 FC 층들(Fully-Connected Layers)과 같은 남은 층들에서 고유 비트 패턴들을 더 잘 식별할 수 있도록 할 수 있다. 합성곱 층들 및 FC 층들은 델타 압축 알고리즘의 내부 프로세스들을 고려하여 수정되거나 리팩터링될 수 있다.If the accuracy of CNN algorithms is not high enough, some parts of CNN algorithms may be optimized. For example, the preprocessing step of the neural network may be enhanced so that the unique bit patterns of the data blocks are enlarged. This pre-processing may enable better identification of unique bit patterns in the remaining layers, such as Convolutional Layers and Fully-Connected Layers (FC Layers). The convolutional layers and the FC layers may be modified or refactored to account for the internal processes of the delta compression algorithm.

다른 실시예에서, CNN을 대신하여 다른 클래스의 신경망들이 적?値? 수도 있다. 많은 다른 대체안들 중에서 FC 층들을 이용하는 경우도 있을 수 있고, 피드포워드 ANN의 클래스에 속하는 멀티레이어 퍼셉트론(MLP)도 이용될 수 있다. MLP는 대부분 딥 FC 층들(Deep Fully-Connected Layers)로 이루어질 수 있으며, 이러한 종류의 다른 신경망들이 적용될 수 있다.In another embodiment, different classes of neural networks are used instead of CNNs. may be Among many other alternatives, it may be possible to use FC layers, and a multilayer perceptron (MLP) belonging to the class of feedforward ANNs may also be used. MLP can mostly consist of Deep Fully-Connected Layers, and other neural networks of this kind can be applied.

신경망 학습모델이 설정되면 도 5에서 준비된 훈련 데이터세트를 이용하여 지도학습이 이루어질 수 있다. 훈련 데이터세트에는 입력값이 데이터 블록과 타겟값인 데이터 블록이 속하는 카테고리의 레이블이 포함되어 있고, 이러한 훈련 데이터세트를 이용한 학습을 통해 신경망 학습모델의 파라미터(예를 들어, 가중치 또는 편향)가 조정될 수 있다.When the neural network learning model is set, supervised learning may be performed using the training dataset prepared in FIG. 5 . The training dataset contains the labels of the categories to which the data block with the input value and the data block with the target value belong. can

상술된 바와 같은 훈련 과정을 거친 학습모델은 이후 입력되는 신규 데이터 블록이 예를 들어, A, B, C 중 어느 카테고리에 속하는 것이 적합한지를 추정할 수 있다. The learning model that has undergone the training process as described above may estimate which category of, for example, A, B, and C, which a new data block input thereafter belongs to.

도 6은 본 개시의 일 실시예에 따른 데이터 압축 방법에서 추정 페이즈를 설명하기 위한 도면이다.6 is a diagram for explaining an estimation phase in a data compression method according to an embodiment of the present disclosure.

본 개시의 일 실시예에 따른 데이터 압축 방법인 DeepComp의 추정은 입력 블록을 디스크 또는 데이터 스토리지 내의 가장 유사한 카테고리로 맵핑시킬 수 있다. 위에서 설명된 바와 같이 사전-훈련된 학습모델을 이용하여 DeepComp는 후보 카테고리들의 리스트 뿐만 아니라 각 카테고리가 최적의 카테고리일 확률도 출력할 수 있다. The estimation of DeepComp, which is a data compression method according to an embodiment of the present disclosure, may map an input block to a most similar category in a disk or data storage. As described above, using the pre-trained learning model, DeepComp can output not only a list of candidate categories, but also the probability that each category is an optimal category.

DeepComp는 델타 압축을 위한 최상위 1개 카테고리를 선택할 수 있거나 더 적절한 기준 블록을 찾기 위해 상위 몇 개의 카테고리들을 평가할 수 있다. 각각의 카테고리에 대해, DeepComp는 기준 블록으로 사용되는 대표 블록을 유지할 수 있다. DeepComp는 입력 데이터 블록에 대해 최적의 카테고리를 식별한 후에, 기준 블록(즉, 선택된 카테고리의 대표 블록)을 판독하고, XOR 연산을 수행하고, XOR 연산의 결과를 압축하고, 최종적으로 압축된 데이터를 디스크에 기록할 수 있다.DeepComp can choose the top 1 category for delta compression or it can evaluate the top few categories to find a more suitable reference block. For each category, DeepComp may maintain a representative block used as a reference block. After DeepComp identifies the optimal category for the input data block, it reads the reference block (i.e., a representative block of the selected category), performs an XOR operation, compresses the result of the XOR operation, and finally extracts the compressed data. It can be written to disk.

더 좋은 정확도를 달성하기 위해서, 학습-투-해쉬(Learning-to-Hash) 알고리즘들을 이용할 수 있다. 후보 카테고리들의 랭킹을 그들의 확률에 따라 출력하는 통상적인 추정 알고리즘과 다르게 학습-투-해시 알고리즘은 주어진 데이터 블록의 주요 피쳐들을 대표하는 해쉬 값을 생성할 수 있다. 로컬리티-센서티브 해슁과 유사하게, 이러한 해쉬 값은 존재하는 데이터 블록들에 대해 가장 가까운 이웃들(또는 카테고리들)을 가리킬 수 있고, 이는 보다 더 유사한 기준 블록들을 찾을 수 있도록 해준다. 그러나, 학습-투-해쉬 알고리즘들은 적절하게 최적화될 수 있도록 추가적인 수정이 필요할 수 있다.To achieve better accuracy, Learning-to-Hash algorithms can be used. Unlike a typical estimation algorithm that outputs the rankings of candidate categories according to their probabilities, the learning-to-hash algorithm can generate hash values representing key features of a given data block. Similar to locality-sensitive hashing, this hash value can point to the nearest neighbors (or categories) for existing data blocks, allowing more similar reference blocks to be found. However, learning-to-hash algorithms may require additional modifications to be properly optimized.

최종적으로, 학습모델의 성능이 심각하게 낮아지는 것이 관측되는 경우가 아니라면 빈번하게 발생하지 않고, 수행될 때에도 백그라운드에서 수행될 수 있는 레이블링 및 학습 단계들과는 다르게, 추정 페이즈는 전체 시스템 성능에 직접적으로 영향을 주게 된다. 추정 페이즈는 기준 페이지들을 검색하기 위해 I/O 뿐만 아니라 추정 알고리즘을 실행하도록 적정한 CPU 사이클을 요구하게 된다. 이러한 오버헤드를 최소화하기 위해, DeepComp는 DRAM에 자주 호출되는 기준 블록들을 보관하는 버퍼 캐시를 이용할 수 있다. LRU(Least Recently Used) 및 LFU(Least Frequently Used)와 같이 자주 호출되는 캐시-대체 정책들이 버퍼 캐시에 통합되어 미싱 비율(miss ratio)을 감소시킬 수 있다. 추정을 위한 연산 비용을 감소시키기 위해 하드웨어 가속기(GPU, TPU 및 NPU 와 같은)를 이용함으로써 추정 단계를 최적화할 수도 있다.Finally, unlike the labeling and learning steps that do not occur frequently unless the performance of the learning model is observed to be severely degraded, and can be performed in the background even when it is performed, the estimation phase directly affects the overall system performance. will give The estimation phase will require a reasonable CPU cycle to execute the estimation algorithm as well as I/O to retrieve the reference pages. To minimize this overhead, DeepComp can use a buffer cache that holds frequently called reference blocks in DRAM. Frequently called cache-replacement policies such as Least Recently Used (LRU) and Least Frequently Used (LFU) may be incorporated into the buffer cache to reduce the miss ratio. The estimation step may be optimized by using hardware accelerators (such as GPUs, TPUs and NPUs) to reduce the computational cost for estimation.

훈련된 학습모델에 신규한 입력 데이터 블록이 입력되면, 학습모델은 도 6의 예시와 같이 신규 입력 데이터 블록이 속해야 할 카테고리를 그 확률과 함께 출력할 수 있다.When a new input data block is input to the trained learning model, the learning model may output a category to which the new input data block should belong, along with its probability, as shown in the example of FIG. 6 .

도 6의 예시와 같이 입력 블록이 95%의 확률로 카테고리 B에 속한다면, 카테고리 B의 데이터 블록이 대표 블록 테이블에 의해 찾아지고, 카테고리 B의 데이터 블록을 기준 블록으로 하여 입력 데이터 블록의 델타 압축이 실행될 수 있다.As in the example of FIG. 6 , if the input block belongs to category B with a 95% probability, the data block of category B is found by the representative block table, and delta compression of the input data block using the data block of category B as a reference block This can be done.

델타 압축의 실행으로 나오는 압축된 블록은 디스크 내에 기록될 수 있으며, 위와 같은 방식을 통해 신규 데이터 블록이 입력되는 경우, 효율적으로 디스크 내에 신규 데이터 블록이 저장될 수 있게 된다.Compressed blocks resulting from the delta compression can be recorded in the disk, and when a new data block is input through the above method, the new data block can be efficiently stored in the disk.

도 7은 본 개시의 일 실시예에 따른 데이터 압축 방법의 순서도를 도시한다.7 is a flowchart of a data compression method according to an embodiment of the present disclosure.

도 7에 도시된 데이터 압축 방법은 도 2에 도시된 데이터 압축 장치의 프로세서(100)에 의해 수행될 수 있다. 먼저, 프로세서(100)는 데이터 스토리지 내에 있는 데이터 블록들에 대해 도 4에서 설명된 방법들(데이터 블록들 간 델타 압축비 측정 또는 K-means 알고리즘 등 비지도 학습의 클러스터링 알고리즘)을 이용하여 유사 데이터 블록들로 이루어지는 카테고리들(클러스터들)을 형성할 수 있다(S100).The data compression method illustrated in FIG. 7 may be performed by the processor 100 of the data compression apparatus illustrated in FIG. 2 . First, the processor 100 uses the methods described in FIG. 4 for the data blocks in the data storage (a clustering algorithm of unsupervised learning, such as a delta compression ratio measurement between data blocks or a K-means algorithm) to block similar data blocks It is possible to form categories (clusters) consisting of ( S100 ).

동일 카테고리 내에 속하는 데이터 블록들은 다른 카테고리에 속하는 데이터 블록보다 서로 높은 유사성을 갖는다. 즉, 유사도가 높은 데이터 블록들이 동일한 카테고리로 묶이게 된다.Data blocks belonging to the same category have a higher similarity to each other than data blocks belonging to other categories. That is, data blocks with high similarity are grouped into the same category.

복수의 카테고리들이 형성되면 각각의 카테고리에 레이블을 할당할 수 있다(S200). 레이블은 A, B, C와 같은 임의의 명칭일 수 있으며, 이후 레이블링된 카테고리를 이용하여 훈련된 학습모델은 신규 데이터 블록이 입력되면 신규 데이터 블록과 가장 유사하다고 판단되는 카테고리 명칭과 해당 카테고리의 유사 확률을 출력할 수 있다.When a plurality of categories are formed, a label may be assigned to each category ( S200 ). The label may be any name such as A, B, or C, and then, when a new data block is input, the training model trained using the labeled category is similar to the category name determined to be most similar to the new data block and the corresponding category. You can print probabilities.

단계 S100과 S200은 레이블링 페이즈로 구분될 수 있으며, 레이블링 페이즈의 일 실시예에 대해서는 도 8에서 보다 구체적으로 설명한다.Steps S100 and S200 may be divided into a labeling phase, and an embodiment of the labeling phase will be described in more detail with reference to FIG. 8 .

다음으로, 레이블링된 카테고리들을 이용하여 학습모델을 훈련시키는 학습 페이즈가 수행될 수 있다(S300). 예를 들어, 레이블링된 카테고리의 명칭을 타겟값, 해당 카테고리에 속하는 데이터 블록을 입력값으로 하는 훈련 데이터세트에 기초한 지도학습이 수행되어 학습모델이 훈련될 수 있다.Next, a learning phase of training the learning model using the labeled categories may be performed (S300). For example, a learning model may be trained by performing supervised learning based on a training dataset using the name of the labeled category as a target value and a data block belonging to the corresponding category as an input value.

한편, 레이블링과 학습 페이즈는 모두 데이터 스토리지에 일정량 이상의 데이터 블록이 저장된 경우에 적절하게 수행될 수 있으므로, 데이터 압축 장치의 프로세서는 미리 설정된 일정 정도의 데이터 양을 초과하는 데이터 스토리지에 대해서만 위와 같은 동작들을 수행하도록 구성될 수 있다.On the other hand, since both the labeling and learning phases can be appropriately performed when more than a certain amount of data blocks are stored in the data storage, the processor of the data compression apparatus performs the above operations only for data storage exceeding a predetermined amount of data. may be configured to perform.

입력되는 데이터 블록이 속하는 카테고리를 추정하도록 학습모델이 준비된 이후에 데이터 스토리지에 신규 데이터 블록이 입력될 수 있다(S400).After the learning model is prepared to estimate the category to which the input data block belongs, a new data block may be input to the data storage (S400).

훈련된 학습모델은 입력된 신규 데이터 블록에 대해 신규 데이터 블록이 속하기에 적합한(즉, 유사성이 가장 높은) 카테고리를 추정할 수 있고, 추정에 기초하여 신규 데이터가 속하는 카테고리가 선택될 수 있다(500). 여기서, 신규 데이터 블록이 속하기에 적합한 카테고리라는 것은 다른 카테고리에 비해 해당 카테고리에 신규 데이터 블록과 유사한 데이터 블록들이 포함되어 있다는 것이다. 다른 예로는, 해당 카테고리 내의 기존 데이터 블록들과 신규 데이터 블록 사이의 델타 거리가 작고, 델타 압축비가 큰 경우라고 할 수 있다.The trained learning model may estimate a category to which the new data block belongs (that is, the highest similarity) with respect to the input new data block, and a category to which the new data belongs may be selected based on the estimation ( 500). Here, the suitable category to which the new data block belongs means that data blocks similar to the new data block are included in the corresponding category compared to other categories. As another example, it may be said that the delta distance between the existing data blocks and the new data block in the corresponding category is small and the delta compression ratio is large.

여기서, 신규 데이터가 속하는 카테고리를 선택하는 방식은 학습모델의 출력 결과에 기초하여 가장 높은 확률값을 가지는 카테고리를 선택하는 방식일 수도 있고, 일정 확률값 이상의 후보 카테고리들 중 특정 방식을 이용하여 하나의 최적 카테고리를 선택하는 방식일 수도 있다. Here, the method of selecting the category to which the new data belongs may be a method of selecting the category having the highest probability value based on the output result of the learning model, or using a specific method among candidate categories with a certain probability value or higher to select one optimal category. It may be a way to select .

후보 카테고리들 중 하나의 최적 카테고리를 선택하는 방식은 도 9에서 보다 상세하게 설명하기로 한다.A method of selecting one optimal category from among the candidate categories will be described in more detail with reference to FIG. 9 .

신규 데이터가 속하는 최적 카테고리가 선택되면, 프로세서(100)는 선택된 카테고리의 대표 데이터 블록을 기초로 신규 데이터 블록에 대한 델타 압축을 수행할 수 있다(S600).When the optimal category to which the new data belongs is selected, the processor 100 may perform delta compression on the new data block based on the representative data block of the selected category (S600).

수행된 델타 압축의 결과로 나온 파일은 데이터 스토리지에 저장될 수 있고(S700), 위와 같은 과정을 통해 데이터 스토리지에는 실제 입력된 신규 데이터 블록보다 더 작은 사이즈로 신규 데이터 블록의 컨텐츠가 저장될 수 있게 된다. The file resulting from the performed delta compression can be stored in the data storage (S700), and through the above process, the data storage can store the contents of the new data block in a size smaller than the actual input new data block. do.

한편, 도 7에서 도시되어 있지는 않지만, 본 개시의 일 실시예에 따른 데이터 압축 방법은, 미리 설정된 수의 신규 데이터 입력, 적합한 카테고리 선택 및 델타 압축 실행을 수행한 이후에, 수행 중 가장 많이 선택된 카테고리의 대표 데이터 블록을 캐시 메모리(120) 또는 버퍼 캐시에 저장하는 단계를 더 포함할 수 있다.Meanwhile, although not shown in FIG. 7 , in the data compression method according to an embodiment of the present disclosure, after inputting a preset number of new data, selecting a suitable category, and executing delta compression, the most selected category is performed. The method may further include storing the representative data block of the in the cache memory 120 or the buffer cache.

예를 들어, 100회의 신규 데이터 입력, 카테고리 선택, 델타 압축 실행 이후에는, 100회 동안 가장 많이 선택되었던 카테고리의 대표 데이터 블록 또는 많이 선택된 상위 5개의 대표 데이터 블록이 캐시 메모리(120) 또는 버퍼 캐시에 저장될 수 있다.For example, after 100 times of new data input, category selection, and delta compression execution, the representative data block of the category most selected for 100 times or the top 5 most selected representative data blocks are stored in the cache memory 120 or the buffer cache. can be saved.

위와 같은 방식을 통해 이후 신규 데이터 블록이 입력되었을 때 델타 압축이 수행되기 위한 기준 블록이 보다 신속하게 호출될 수 있게 될 수 있다.Through the above method, when a new data block is subsequently input, the reference block for performing delta compression can be called more quickly.

도 8은 본 개시의 일 실시예에 따른 데이터 압축 방법의 레이블링 페이즈의 순서도를 도시한다.8 is a flowchart illustrating a labeling phase of a data compression method according to an embodiment of the present disclosure.

레이블링 페이즈에서 프로세서(100)는 먼저 데이터 스토리지 내에서 클러스터링 할 대상이 되는 적어도 일부의 데이터 블록들을 선택할 수 있다(S610). 프로세서(100)는 데이터 스토리지 내의 모든 데이터 블록을 대상으로 클러스터링을 수행할 수도 있지만, 호출 횟수 등과 같은 데이터 블록의 사용 빈도를 이용하여 보다 자주 사용되는 데이터 블록들을 대상으로 클러스터링을 수행할 수도 있다.In the labeling phase, the processor 100 may first select at least some data blocks to be clustered in the data storage ( S610 ). The processor 100 may perform clustering on all data blocks in the data storage, but may also perform clustering on more frequently used data blocks by using the frequency of use of the data blocks, such as the number of calls.

클러스터링할 대상의 데이터 블록들이 선택되면, 프로세서(100)는 선택된 데이터 블록들 간에 델타 압축비를 연산할 수 있다(S120). 델타 압축비의 연산은, 예를 들어, a 데이터 블록과 b 데이터 블록에 대해 수행한다면, 먼저, a 데이터 블록과 b 데이터 블록에 XOR 연산을 수행하고, XOR 결과에 대해 압축을 한 후, 압축비를 계산함으로써 이루어질 수 있다. 델타 압축비는 0에서 1 사이의 값일 수 있으며, 압축비가 높을수록 압축이 많이 이루어지는 것이고, 압축비가 낮을수록 압축이 잘 되지 않는 것이다.When data blocks to be clustered are selected, the processor 100 may calculate a delta compression ratio between the selected data blocks ( S120 ). For delta compression ratio calculation, for example, if data block a and data block b are performed, XOR operation is first performed on data block a and data block b, and after compression is performed on the XOR result, the compression ratio is calculated. This can be done by The delta compression ratio may be a value between 0 and 1. The higher the compression ratio, the greater the compression, and the lower the compression ratio, the poorer the compression.

데이터 블록들 간에 델타 압축비들이 연산되면, 미리 설정된 델타 압축비의 임계치(예를 들어, 0.7)를 기준으로 임계치 이상의 델타 압축비를 보여주는 데이터 블록 쌍들만을 동일 카테고리에 속하도록 할 수 있다. 즉, 미리 설정된 델타 압축비의 임계치를 기준으로 데이터 블록들을 카테고라이징 할 수 있다(S130).When delta compression ratios are calculated between data blocks, based on a preset delta compression ratio threshold (eg, 0.7), only data block pairs showing a delta compression ratio greater than or equal to the threshold may be included in the same category. That is, data blocks may be categorized based on a preset threshold of the delta compression ratio (S130).

이후에는, 각각의 카테고리를 대표할 수 있는 대표 데이터 블록을 찾기 위해서 각 카테고리 내의 데이터 블록들 간의 델타 압축비들을 연산할 수 있다(S140). Thereafter, delta compression ratios between data blocks in each category may be calculated to find a representative data block that can represent each category ( S140 ).

프로세서(100)는 연산된 델타 압축비들을 기초로 각 카테고리 내의 대표 데이터 블록을 선정할 수 있다. 각 카테고리에서 대표 데이터 블록을 선정하기 위해서는, 카테고리 내에서 대표 데이터 블록이 될 수 있는지 평가의 대상이 되는 일 데이터 블록이 해당 카테고리 내의 다른 데이터 블록들과 가지는 델타 압축비들을 연산하는 단계를 포함할 수 있다. 예를 들어, 자신이 속한 카테고리 내의 다른 데이터 블록들과의 델타 압축비들의 합을 가장 크게 만드는 데이터 블록이 해당 카테고리의 대표 데이터 블록으로 선정될 수 있다. 즉, 카테고리 내의 다른 데이터 블록들과의 델타 거리들의 합이 가장 짧은 데이터 블록이 대표 데이터 블록으로 선정될 수 있다.The processor 100 may select a representative data block in each category based on the calculated delta compression ratios. In order to select a representative data block in each category, it may include calculating delta compression ratios of one data block, which is the subject of evaluation to be a representative data block in the category, with other data blocks in the corresponding category. . For example, a data block making the largest sum of delta compression ratios with other data blocks in a category to which it belongs may be selected as a representative data block of the corresponding category. That is, a data block having the shortest sum of delta distances with other data blocks in a category may be selected as the representative data block.

각 카테고리의 대표 데이터 블록이 선정되고, 각 카테고리에 레이블이 할당될 수 있다(S160). 다만, 레이블 할당과 각 카테고리의 대표 데이터 블록의 선정은 동시에 이루어질 수도 있고, 도 8에서와는 반대의 순서로 이루어질 수도 있다. A representative data block of each category may be selected, and a label may be assigned to each category (S160). However, the label assignment and the selection of the representative data block of each category may be performed simultaneously or in the reverse order to that of FIG. 8 .

도 9는 본 개시의 일 실시예에 따른 데이터 압축 방법의 추정 페이즈의 순서도를 도시한다.9 is a flowchart illustrating an estimation phase of a data compression method according to an embodiment of the present disclosure.

추정 페이즈의 일 실시예에서, 프로세서(100)는 훈련된 학습모델을 이용하여 신규 데이터 블록이 속할 후보 카테고리들의 리스트를 생성할 수 있다(S610). 후보 카테고리들은 미리 설정된 일정 확률치 이상의 유사도 확률을 보이는 카테고리들로 1차적으로 결정될 수 있다. In an embodiment of the estimation phase, the processor 100 may generate a list of candidate categories to which the new data block belongs by using the trained learning model (S610). Candidate categories may be primarily determined as categories having a similarity probability greater than or equal to a predetermined probability value.

예를 들어, 미리 설정된 확률치가 40% 였고, 학습모델을 신규 데이터 블록에 적용한 결과 카테고리 A, B, C가 각각 50%, 45%, 5%의 유사 확률을 갖는 것으로 추정되었다면, 후보 카테고리는 A, B로 결정될 수 있다. For example, if the preset probability value was 40%, and as a result of applying the learning model to the new data block, categories A, B, and C were estimated to have similar probabilities of 50%, 45%, and 5%, respectively, the candidate category would be A , can be determined as B.

프로세서(100)는 후보 카테고리들의 대표 데이터 블록들과 신규 데이터 블록의 델타 압축비들을 연산할 수 있다(S620). 즉, 위의 예에서는 카테고리 A의 대표 데이터 블록과 신규 데이터 블록의 델타 압축비를 연산하고, 카테고리 B의 대표 데이터 블록과 신규 데이터 블록의 델타 압축비를 연산하게 된다.The processor 100 may calculate delta compression ratios of representative data blocks of candidate categories and new data blocks ( S620 ). That is, in the above example, the delta compression ratio of the representative data block of category A and the new data block is calculated, and the delta compression ratio of the representative data block of category B and the new data block is calculated.

프로세서(100)는 연산된 델타 압축비가 가장 높은 대표 데이터 블록이 속한 카테고리에 신규 데이터 블록이 속하는 것으로 결정할 수 있다(S630). 예를 들어, 신규 데이터 블록과 카테고리 A의 대표 데이터 블록 사이의 델타 압축비가 0.7이고, 신규 데이터 블록과 카테고리 B의 대표 데이터 블록 사이의 델타 압축비가 0.8이라면, 카테고리의 유사도 확률은 A가 높았다 하더라도 카테고리 B가 신규 데이터 블록에 대한 최적 카테고리로 결정된다.The processor 100 may determine that the new data block belongs to a category to which the representative data block having the highest calculated delta compression ratio belongs ( S630 ). For example, if the delta compression ratio between the new data block and the representative data block of category A is 0.7, and the delta compression ratio between the new data block and the representative data block of category B is 0.8, the category similarity probability is high even if A is high. B is determined as the optimal category for the new data block.

이후, 프로세서(100)는 카테고리 B의 대표 데이터 블록을 기초로 신규 데이터 블록의 델타 압축을 수행할 수 있고(S640), 가장 큰 델타 압축비를 가지고 신규 데이터 블록을 압축할 수 있게 된다. Thereafter, the processor 100 may perform delta compression of the new data block based on the representative data block of category B ( S640 ), and may compress the new data block with the largest delta compression ratio.

도 10은 본 개시의 일 실시예에 따른 데이터 압축 방법과 비교되는 최적(Optimal) 델타 압축을 설명하기 위한 도면이다.10 is a diagram for explaining optimal delta compression compared to a data compression method according to an embodiment of the present disclosure.

본 개시의 일 실시예에 따라 데이터 사이즈를 감소시키는 DeepComp의 잠재적 효과를 이해하기 위해 7개의 서로 다른 애플리케이션 시나리오들로부터 수집한 워크로드들을 이용하여 실험을 수행할 수 있다. DeepComp를 기존에 존재하는 3개의 데이터 감소 기법들인, 최적 델타 압축(Optimal로 지칭), 무손실 압축(Comp로 지칭), 및 무손실 압축 플러스 중복제거(Comp+Dedup로 지칭) 기법과 비교하였다. 추가적으로 DeepComp를 2개의 델타 압축 기법인, 중복제거-조력 압축(Dedup-Assisted Compression, DAC) 및 로컬리티-센서티브 해슁(Locality-Sensitive Hashing, LSH) 알고리즘과도 비교하였다.In order to understand the potential effect of DeepComp in reducing data size according to an embodiment of the present disclosure, an experiment may be performed using workloads collected from seven different application scenarios. DeepComp was compared with three existing data reduction techniques, optimal delta compression (referred to as Optimal), lossless compression (referred to as Comp), and lossless compression plus deduplication (referred to as Comp+Dedup) technique. Additionally, DeepComp was compared with two delta compression techniques, the Dedup-Assisted Compression (DAC) and Locality-Sensitive Hashing (LSH) algorithms.

Optimal은 최적 델타 압축 알고리즘을 나타낸다. 주어진 4 KB 입력 블록에 대해 Optimal은 디스크 내의 모든 데이터 블록들에 대해 완전한 검색을 수행하여 주어진 입력 블록과 가장 유사한 4 KB 블록을 찾는다. 예를 들어, 디스크가 n개의 4 KB 데이터 블록들을 저장하고 있다고 하면, Optimal은 입력 블록과 n개의 데이터 블록들 사이의 엔트로피 값들을 측정한다. n개의 데이터 블록들과의 연산을 통해 가장 낮은 엔트로피를 가지는 블록이 기준 블록으로 선택된다. 그리고, 입력 블록과 기준 블록에 대해 델타 압축이 수행된다. Optimal은 또한 입력 블록을 무손실 압축 알고리즘을 이용하여 압축할 수 있다(예를 들어, XMatch-Pro를 이용). 위와 같이 델타 압축을 수행한 결과와 무손실 압축 알고리즘을 이용한 결과, 2개의 결과 중 더 작은 하나를 최종적으로 디스크에 기록한다. Optimal 델타 압축은 엄청난 양의 디스크 I/O와 CPU 연산(즉, 최소한 n번의 블록 판독과 n번의 엔트로피 측정)을 필요로 하기 때문에 실제적으로는 적용 불가능한 기법이다. 그러나, Optimal을 통한 결과는 델타 압축을 이용하여 달성할 수 있는 이론적인 상한을 제공하는데 있어서 의미를 가진다.Optimal stands for an optimal delta compression algorithm. For a given 4 KB input block, Optimal performs a complete search on all data blocks on disk to find the 4 KB block that is most similar to the given input block. For example, if the disk stores n 4 KB data blocks, Optimal measures the entropy values between the input block and the n data blocks. A block having the lowest entropy is selected as a reference block through operation with n data blocks. Then, delta compression is performed on the input block and the reference block. Optimal can also compress the input block using a lossless compression algorithm (eg using XMatch-Pro). As above, the result of delta compression and the result of using the lossless compression algorithm, the smaller one of the two results is finally written to the disk. Optimal delta compression is a practically impractical technique because it requires a huge amount of disk I/O and CPU operations (ie, at least n block reads and n entropy measurements). However, the results through Optimal are meaningful in providing a theoretical upper bound that can be achieved using delta compression.

도 10을 참조로 설명하면, 일 예의 디스크에서는 5개의 데이터 블록들(각각이 4 KB)이 저장되어 있다. 신규한 입력 블록이 들어오면, Optimal은 모든 블록들(즉, 'A', 'B', 'C', 'D', 'E')에 대해 엔트로피 값들을 측정한다. 엔트로피 값을 계산하는 방법은, 먼저 입력 블록과 존재하는 데이터 블록들 중 비교 대상 블록 하나(예를 들어, 'A')와 XOR 연산을 수행하고, XOR 연산 결과인 데이터의 엔트로피를 쉐넌의 정의에 따라 계산한다. Optimal은 다른 모든 데이터 블록들에 대해 위의 단계들을 반복하고, 예를 들어, 'E'블록이 가장 낮은 엔트로피 값을 나타내면 'E'블록이 기준 블록으로 선택된다.Referring to FIG. 10 , five data blocks (each 4 KB) are stored in an exemplary disk. When a new input block comes in, Optimal measures entropy values for all blocks (ie, 'A', 'B', 'C', 'D', 'E'). The method of calculating the entropy value is to first perform an XOR operation with the input block and one of the data blocks to be compared among existing data blocks (eg, 'A'). Calculate according to Optimal repeats the above steps for all other data blocks, for example, if block 'E' represents the lowest entropy value, block 'E' is selected as the reference block.

Optimal에서는 입력 블록에 대해 모든 데이터 블록과 XOR 연산을 수행하고 엔트로피를 측정함으로써, 입력 블록에 대해 데이터 스토리지 내에서 최적의 기준 블록을 찾을 수 있게 해준다. 다만, 모든 데이터 블록과 비교하는 것은 상당한 연산 리소스와 시간을 필요로 하는 것으로, 실제적인 상황에서 구현되는 것은 매우 어렵다.Optimal makes it possible to find the optimal reference block in the data storage for the input block by performing an XOR operation with all data blocks on the input block and measuring the entropy. However, comparison with all data blocks requires considerable computational resources and time, and it is very difficult to implement in a practical situation.

도 11은 본 개시의 일 실시예에 따른 데이터 압축 방법과 기존 압축 방법 중 일부 방식의 효율을 비교하는 테이블을 도시한다.11 is a table for comparing the efficiency of some of the data compression method and the existing compression method according to an embodiment of the present disclosure.

Comp에서는 모든 4 KB 입력 블록에 대해 무손실 압축을 수행한다. 일 예에서, 무손실 압축을 위해 XMatch-Pro 알고리즘을 선택하였다. XMatch-Pro 알고리즘이 좋은 압축비를 보여주고 하드웨어(예를 들어, ASIC 및 FPGA)로 용이하게 구현될 수 있기 때문이다.Comp performs lossless compression on every 4 KB input block. In one example, the XMatch-Pro algorithm was chosen for lossless compression. This is because the XMatch-Pro algorithm shows a good compression ratio and can be easily implemented in hardware (eg ASICs and FPGAs).

Comp+Dedup은 Comp와 유사하나 무손실 압축을 적용하기 전에 데이터 중복제거를 수행한다는 점에서 다르다. Comp+Dedup은 디스크에 이전에 기록된 4 KB 데이터 블록들의 핑거프린트를 보관하는 핑거프린트 스터어를 내부적으로 유지한다. 여기서, 핑거프린트 스토어는 존재하는 데이터 블록들의 모든 핑거프린트를 보관할 수 있을 정도로 크다고 가정한다. 신규한 입력 블록이 들어오면, Comp+Dedup는 입력 블록의 핑거프린트를 연산하고 핑거프린트 스토어에 매칭되는 핑거프린트가 있는지 탐색한다. 만약 입력 블록의 핑거프린트와 매칭되는 핑거프린트가 있다면, Comp+Dedup는 동일한 블록이 이미 디스크 내에 저장되어 있으므로 입력 블록을 기록하지 않는다. 만약 입력 블록의 핑거프린트와 매칭되는 핑거프린트가 없다면, Comp+Dedup는 입력 블록을 XMatch-Pro를 사용하여 압축하고 디스크에 저장한다. 신규한 핑거프린트는 차후의 중복제거를 위해 핑거프린트 스토어에 추가된다.Comp+Dedup is similar to Comp, except that it deduplicates data before applying lossless compression. Comp+Dedup internally maintains a fingerprint store that holds a fingerprint of 4 KB data blocks previously written to disk. Here, it is assumed that the fingerprint store is large enough to store all fingerprints of existing data blocks. When a new input block comes in, Comp+Dedup computes the fingerprint of the input block and searches the fingerprint store for matching fingerprints. If there is a fingerprint matching the fingerprint of the input block, Comp+Dedup does not write the input block because the same block is already stored on disk. If there is no fingerprint matching the fingerprint of the input block, Comp+Dedup compresses the input block using XMatch-Pro and saves it to disk. A new fingerprint is added to the fingerprint store for subsequent deduplication.

기존 데이터 감소 기법들과 DeepComp의 초기 프로토타입을 비교해보았다. 본 개시의 DeepComp를 위해서 4 KB 데이터 블록들을 위한 비지도 학습 기반의 클러스터링 알고리즘이 이용되어 블록들의 비트 패턴이 분석되고 동일 카테고리에 유사한 데이터 블록들이 그룹핑되게 되었다. 각각의 그룹에 대해, DeepComp는 그 그룹의 공통 비트 패턴들을 포함하는 대표 블록을 할당한다.We compared the existing data reduction methods with the initial prototype of DeepComp. For DeepComp of the present disclosure, an unsupervised learning-based clustering algorithm for 4 KB data blocks was used to analyze the bit patterns of the blocks and group similar data blocks in the same category. For each group, DeepComp allocates a representative block containing the group's common bit patterns.

주어진 신규 입력 블록에 대해 DeepComp는 입력 블록과 k개(k<n)의 대표 블록들 사이에 엔트로피 값들을 모두 측정함으로써 가장 유사한 블록을 찾을 수 있다. 가장 낮은 엔트로피 값을 가지는 대표 블록은 델타 압축을 위한 기준 블록으로 선택될 수 있다. 여기서 k는 n 보다 작은 임의의 수이며 k개의 대표 블록들은 입력 블록에 대해 가장 유사성이 높은 k개의 카테고리들의 대표 블록일 수 있다.For a given new input block, DeepComp can find the most similar block by measuring all entropy values between the input block and k (k<n) representative blocks. A representative block having the lowest entropy value may be selected as a reference block for delta compression. Here, k is an arbitrary number less than n, and the k representative blocks may be representative blocks of k categories with the highest similarity to the input block.

도 11에서는 다양한 컴퓨팅 환경(Install, PC, Web, Synth, Update, Dedup, VM 등을 수행하는 경우)에서 DeepComp와 다른 데이터 감소 방식들의 압축비를 비교한 실험 결과를 도시한다. 11 shows experimental results comparing compression ratios of DeepComp and other data reduction methods in various computing environments (in the case of performing Install, PC, Web, Synth, Update, Dedup, VM, etc.).

이론적인 성능 한계를 보여주는 Optimal이 최상의 압축비를 보여주고, 평균적으로 원래 데이터의 24.2%까지 데이터 압축을 수행할 수 있다. Comp와 Comp+Dedup은 46.1%와 36.5%의 데이터 크기 감소를 제공하여 더 낮은 압축비를 제공한다. Optimal, which shows the theoretical performance limit, shows the best compression ratio, and can perform data compression up to 24.2% of the original data on average. Comp and Comp+Dedup offer lower compression ratios, offering data size reductions of 46.1% and 36.5%.

DeepComp는 원래 데이터의 크기에서 평균 28.4%까지 압축이 되어 거의 Optimal에 가까운 압축비를 보여주고 있다. Optimal이 실제 환경에서 적용될 수 없는 이론적 수치이고, DeepComp는 연산 리소스와 연산 시간이 제한된 경우에도 사용될 수 있음을 고려하고 도 11의 결과를 참조하면 본 개시의 DeepComp 방식이 매우 효과적인 압축 솔루션을 제공하고 있음을 알 수 있다.DeepComp is compressed up to 28.4% on average from the original data size, showing a compression ratio close to Optimal. Considering that Optimal is a theoretical number that cannot be applied in a real environment, and DeepComp can be used even when computational resources and computational time are limited, referring to the result of FIG. 11, the DeepComp method of the present disclosure provides a very effective compression solution. can be known

도 12는 본 개시의 일 실시예에 따른 데이터 압축 방법과 기존 압축 방법 중 다른 방식의 효율을 비교하는 테이블을 도시한다.12 is a table for comparing the efficiency of the data compression method according to an embodiment of the present disclosure and another of the existing compression methods.

도 12에서 DAC는 유사한 기준 블록들을 찾기 위한 I/O와 CPU 오버헤드를 감소시키기 위해 고안된 델타 압축 알고리즘이다. DAC는 유사한 블록들을 체크하기 위해 간단한 휴리스틱을 이용한다. 즉, 4 KB 데이터 블록의 1 KB 데이터 청크(data chunk)를 위한 4개의 핑거프린트들을 보관하고 블록들 간의 유사성을 체크하기 위해 그들을 사용한다. Comp+Dedup과 같이, 핑거프린트들은 핑거프린트 스토어에 저장된다. DAC는 입력 블록으로부터 핑거프린트들을 추출하고 핑거프린트 스토어에서 유사한 핑거프린트를 탐색한다. 3개 이상의 동일한 핑거프린트들을 가지는 블록이 존재하면 DAC는 해당 블록이 충분히 유사하다고 판단하여 그 블록을 기준 블록으로 사용한다. 유사한 블록이 없다면 DAC는 XMatch-Pro를 사용하여 입력 데이터 블록에 대해 무손실 압축을 수행한다.In FIG. 12, DAC is a delta compression algorithm designed to reduce I/O and CPU overhead for finding similar reference blocks. The DAC uses a simple heuristic to check for similar blocks. That is, it keeps 4 fingerprints for a 1 KB data chunk of a 4 KB data block and uses them to check the similarity between blocks. Like Comp+Dedup, fingerprints are stored in the fingerprint store. The DAC extracts fingerprints from the input block and searches for similar fingerprints in the fingerprint store. If a block having three or more identical fingerprints exists, the DAC determines that the block is sufficiently similar and uses the block as a reference block. If there is no similar block, the DAC uses XMatch-Pro to perform lossless compression on the input data block.

LSH는 DAC에 기초한 것이나 핑거프린트들 대신에 로컬리티-센서티브 해슁을 사용한다. 각각의 블록에 대해, LSH는 민해쉬(MinHash) 알고리즘을 사용하여 해쉬 값을 연산하고 그 값을 민해쉬 스토어에 저장한다. 민해쉬는 유사한 4 KB 데이터 블록들에 대해 유사한 해쉬 넘버를 생성하는 고유한 속성을 가진다. 신규한 입력 블록이 들어오면, LSH는 입력 블록의 해쉬 값을 계산하고, 가장 가까운 해쉬 값을 가지는 블록을 찾도록 민해쉬 스토어를 탐색한다. 가장 가까운 해쉬 값을 가지는 블록이 찾아지면 해당 블록이 기준 블록으로 이용된다. LSH is based on DAC but uses locality-sensitive hashing instead of fingerprints. For each block, LSH computes a hash value using the MinHash algorithm and stores the value in the MinHash store. Minhash has the unique property of generating similar hash numbers for similar 4 KB data blocks. When a new input block comes in, the LSH calculates the hash value of the input block and searches the minhash store to find the block with the closest hash value. When a block with the closest hash value is found, the block is used as a reference block.

도 12는 3개의 워크로드들, PC, Synth 및 Web에서 DeepComp를 DAC 및 LSH와 비교한 결과를 도시하는데, Web의 경우를 제외하고 DAC와 LSH는 높은 압축비를 보여주지 않는다. DeepComp는 모든 경우에 대해 DAC와 LSH보다 높은 압축비를 보여준다. 12 shows the results of comparing DeepComp with DAC and LSH in three workloads, PC, Synth and Web. Except for Web, DAC and LSH do not show a high compression ratio. DeepComp shows a higher compression ratio than DAC and LSH in all cases.

도 13은 본 개시의 일 실시예에 따른 데이터 압축 방법과 기존 압축 방법 중 일부 방식의 압축비를 비교하는 그래프를 도시한다. 13 is a graph illustrating a comparison between a data compression method according to an embodiment of the present disclosure and a compression ratio of some of the existing compression methods.

도 13에서는 DAC, LSH 및 DeepComp로 압축을 수행한 이후의 데이터 사이즈 비율을 알 수 있다. 도 13 Comp+Dedup을 기준으로 표준화되었으며, DAC와 LSH와 비교하여도 DeepComp는 원래 데이터의 사이즈를 8.48%에서 13.83%까지 더 감소시킨다는 것을 알 수 있다.In FIG. 13 , the data size ratio after compression using DAC, LSH, and DeepComp can be seen. It is normalized based on Comp+Dedup of FIG. 13, and it can be seen that DeepComp further reduces the size of the original data from 8.48% to 13.83% even compared to DAC and LSH.

본 개시에서 제안된 DeepComp는 다양한 플랫폼에서 구현될 수 있다. 일 예에서, DeepComp는 다수의 SSD를 가지는 AFA(All-Flash Array) 시스템에서 구현될 수 있다. AFA 시스템은 고성능 GPU들과 함께 x86-기반의 CPU를 활용하고 이러한 풍부한 하드웨어 사양은 레이블링, 학습, 추정 단계 모두를 포함한 전체 알고리즘을 하나의 박스 안에서 수행할 수 있도록 할 수 있다. DeepComp proposed in the present disclosure may be implemented in various platforms. In one example, DeepComp may be implemented in an All-Flash Array (AFA) system having a plurality of SSDs. The AFA system utilizes an x86-based CPU along with high-performance GPUs, and this rich hardware specification allows the entire algorithm, including labeling, training, and estimation steps, to be performed in one box.

추정 페이즈가 충분히 빠르다면, DeepComp는 온라인 압축을 지원할 수도 있다. DeepComp가 디스크에 기록되는 물리적인 데이터의 양을 줄여주기 때문에 온라인 압축은 추가적인 성능 이득을 줄 수 있다. 추정 페이즈가 무시할 수 없을 만한 오버헤드를 일으킨다면, 이전에 기록된 데이터 블록들을 백그라운드에서 압축하는 오프라인 압축이 더 선호될 수도 있다. 오프라인 압축이 수행되는 경우에는 전체적인 압축비가 더 높을 것이다.If the estimation phase is fast enough, DeepComp may support online compression. Online compression can provide additional performance gains because DeepComp reduces the amount of physical data written to disk. If the estimation phase incurs non-negligible overhead, offline compression, which compresses previously written data blocks in the background, may be preferred. If offline compression is performed, the overall compression ratio will be higher.

이상 설명된 본 발명에 따른 실시예는 컴퓨터 상에서 다양한 구성요소를 통하여 실행될 수 있는 컴퓨터 프로그램의 형태로 구현될 수 있으며, 이와 같은 컴퓨터 프로그램은 비일시적인 컴퓨터로 판독 가능한 매체에 기록될 수 있다. 이때, 매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등과 같은, 프로그램 명령어를 저장하고 실행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다.The embodiment according to the present invention described above may be implemented in the form of a computer program that can be executed through various components on a computer, and such a computer program may be recorded in a non-transitory computer-readable medium. In this case, the medium includes a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, a magneto-optical medium such as a floppy disk, and a ROM , RAM, flash memory, etc., a hardware device specially configured to store and execute program instructions.

한편, 상기 컴퓨터 프로그램은 본 발명을 위하여 특별히 설계되고 구성된 것이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수 있다. 컴퓨터 프로그램의 예에는, 컴파일러에 의하여 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용하여 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함될 수 있다.Meanwhile, the computer program may be specially designed and configured for the present invention, or may be known and used by those skilled in the computer software field. Examples of the computer program may include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

본 발명의 명세서(특히 특허청구범위에서)에서 "상기"의 용어 및 이와 유사한 지시 용어의 사용은 단수 및 복수 모두에 해당하는 것일 수 있다. 또한, 본 발명에서 범위(range)를 기재한 경우 상기 범위에 속하는 개별적인 값을 적용한 발명을 포함하는 것으로서(이에 반하는 기재가 없다면), 발명의 상세한 설명에 상기 범위를 구성하는 각 개별적인 값을 기재한 것과 같다. In the specification of the present invention (especially in the claims), the use of the term "above" and similar referential terms may be used in both the singular and the plural. In addition, when a range is described in the present invention, each individual value constituting the range is described in the detailed description of the invention as including the invention to which individual values belonging to the range are applied (unless there is a description to the contrary). same as

본 발명에 따른 방법을 구성하는 단계들에 대하여 명백하게 순서를 기재하거나 반하는 기재가 없다면, 상기 단계들은 적당한 순서로 행해질 수 있다. 반드시 상기 단계들의 기재 순서에 따라 본 발명이 한정되는 것은 아니다. 본 발명에서 모든 예들 또는 예시적인 용어(예들 들어, 등등)의 사용은 단순히 본 발명을 상세히 설명하기 위한 것으로서 특허청구범위에 의해 한정되지 않는 이상 상기 예들 또는 예시적인 용어로 인해 본 발명의 범위가 한정되는 것은 아니다. 또한, 당업자는 다양한 수정, 조합 및 변경이 부가된 특허청구범위 또는 그 균등물의 범주 내에서 설계 조건 및 팩터에 따라 구성될 수 있음을 알 수 있다.The steps constituting the method according to the present invention may be performed in an appropriate order unless the order is explicitly stated or there is no description to the contrary. The present invention is not necessarily limited to the order in which the steps are described. The use of all examples or exemplary terms (eg, etc.) in the present invention is merely for the purpose of describing the present invention in detail, and the scope of the present invention is limited by the examples or exemplary terms unless defined by the claims. it's not going to be In addition, those skilled in the art will recognize that various modifications, combinations, and changes can be made in accordance with design conditions and factors within the scope of the appended claims or their equivalents.

따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등한 또는 이로부터 등가적으로 변경된 모든 범위는 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the above-described embodiments, and the scope of the spirit of the present invention is not limited to the scope of the scope of the present invention. will be said to belong to

Claims

A method of compressing data, comprising:
forming a plurality of categories consisting of similar data blocks for at least some data blocks in the data storage;
assigning a label to each category;
training a learning model for determining a category to which a new data block belongs by using the labeled categories;
receiving a new data block;
selecting a category suitable for the new data block using the learning model; and
performing delta compression of the new data block based on the representative data block of the selected category;
Data compression method.

The method of claim 1,
Forming the categories comprises:
calculating a delta compression ratio between at least some data blocks in the data storage; and
forming a plurality of categories consisting of similar data blocks based on the calculated delta compression ratio;
Data compression method.

The method of claim 1,
Allocating the label comprises:
Selecting a representative data block from each category,
The step of selecting the representative data block comprises:
calculating delta compression ratios that a data block to be evaluated in each category has with other data blocks; and
Including the step of selecting a representative data block in each category based on the calculated delta compression ratios,
Data compression method.

The method of claim 1,
Selecting a category suitable for the new data block includes:
selecting candidate categories expected to be suitable for the new data block by using the learning model;
calculating a delta compression ratio of representative data blocks of each of the candidate categories and the new data block; and
Selecting a category to which a representative data block having the highest delta compression ratio belongs as an optimal category for the new data block,
Data compression method.

The method of claim 1,
When the trained learning model receives the new data block, it is configured to output a list of candidate categories expected to be suitable for the new data block and a probability that each candidate category is an optimal category.
Data compression method.

The method of claim 1,
The step of performing the delta compression comprises:
performing an XOR operation on the representative data block and the new data block; and
Comprising the step of compressing the result of the XOR operation,
Data compression method.

The method of claim 1,
After entering a preset number of new data, selecting a suitable category, and running delta compression,
Further comprising the step of storing the representative data block of the most selected category during the execution in a buffer cache,
Data compression method.

A non-transitory computer-readable medium having a computer program stored thereon, comprising:
The computer program, when executed on a computer, is configured to cause the computer to implement a method according to any one of claims 1 to 7,
Non-transitory computer-readable media.

A data compression device for controlling data storage, comprising:
Memory; and
at least one processor coupled to the memory and configured to execute computer readable instructions contained in the memory;
The at least one processor,
forming a plurality of categories of like data blocks for at least some data blocks in the data storage;
assigning a label to each category;
An operation of training a learning model for determining the category to which the new data block belongs by using the labeled categories;
The operation of receiving a new data block as input,
selecting a category suitable for the new data block using the learning model; and
and perform the operation of performing delta compression of the new data block based on the representative data block of the selected category.
data compression device.

10. The method of claim 9,
The operation of forming the categories is
calculating a delta compression ratio between at least some data blocks in the data storage, and
forming a plurality of categories consisting of similar data blocks based on the calculated delta compression ratio;
data compression device.

10. The method of claim 9,
The operation of allocating the label is
Including the operation of selecting a representative data block in each category,
The operation of selecting the representative data block is,
An operation of calculating delta compression ratios of a data block to be evaluated in each category with other data blocks, and
Including the operation of selecting a representative data block in each category based on the calculated delta compression ratios,
data compression device.

10. The method of claim 9,
The operation of selecting a category suitable for the new data block includes:
selecting candidate categories expected to be suitable for the new data block using the learning model;
calculating a delta compression ratio of representative data blocks of each of the candidate categories and the new data block; and
Selecting a category to which the representative data block having the highest delta compression ratio belongs as an optimal category for the new data block,
data compression device.

10. The method of claim 9,
When the trained learning model receives the new data block, it is configured to output a list of candidate categories expected to be suitable for the new data block and a probability that each candidate category is an optimal category.
data compression device.

10. The method of claim 9,
The operation of performing the delta compression is,
performing an XOR operation on the representative data block and the new data block; and
Including the operation of compressing the result of the XOR operation,
data compression device.

10. The method of claim 9,
The at least one processor,
After entering a preset number of new data, selecting a suitable category, and running delta compression,
configured to further perform the operation of storing the representative data block of the category most selected during the execution in the buffer cache,
data compression device.

.