KR20150035876A

KR20150035876A - Method for de-duplicating data and apparatus therefor

Info

Publication number: KR20150035876A
Application number: KR20150026018A
Authority: KR
Inventors: 박찬익; 박세진
Original assignee: 포항공과대학교 산학협력단
Priority date: 2015-02-24
Filing date: 2015-02-24
Publication date: 2015-04-07

Abstract

Disclosed are a method for eliminating data duplication and an apparatus therefor. The method for eliminating data duplication comprises the steps of: acquiring the data access attributes of data based on a request of data input or data output; determining a unit for eliminating data duplication based on the access attributes; and performing data deduplication based on the unit for eliminating data duplication. Accordingly, it is possible to provide a low input/output latency.

Description

&Lt; Desc / Clms Page number 1 > METHOD FOR DE-DUPLICATING DATA AND APPARATUS THEREFOR &

본 발명은 데이터 중복 제거 기술에 관한 것으로, 더욱 상세하게는 낮은 입출력 레이턴시(latency)를 제공하기 위한 데이터 중복 제거 방법 및 장치에 관한 것이다.The present invention relates to a data de-duplication technique, and more particularly, to a data de-duplication method and apparatus for providing a low input / output latency.

데이터 중복 제거 기술은 데이터 저장 장치 내에 중복된 데이터를 제거하여 더 많은 저장 공간을 확보하기 위한 기술을 의미한다. 현재 많은 기업, 공공기관 등에서 데이터의 안전한 보관을 위해 데이터의 백업(backup)을 주기적으로 수행하고 있다. 백업 데이터는 그 특성상 많은 중복적인 요소를 가지며, 이에 따라 백업 데이터의 저장 공간의 효율을 향상시키기 위해 데이터 중복 제거 기술이 사용되고 있다. 이와 같은 데이터 중복 제거 기술은 백업 데이터 저장 장치의 특성상 낮은 입출력 레이턴시(latency)를 필요로 하지 않기 때문에 중복 제거율을 높이는 기술을 중심으로 발전해 오고 있다.Data de-duplication technology refers to a technique for securing more storage space by removing redundant data in a data storage device. Currently, many companies and public institutions are regularly performing backups of data in order to safely store data. BACKGROUND ART [0002] Backup data has many redundant characteristics, and data de-duplication technology is used to improve the efficiency of storage space of backup data. Such data deduplication techniques have been developed mainly on techniques for increasing the deduplication rate because they do not require low input / output latency due to the nature of the backup data storage device.

그러나, 이러한 데이터 중복 제거 기술은 중복 제거를 위한 복잡한 알고리즘(algorithm)을 기초로 수행되기 때문에, 노트북(notebook), 스마트폰(smart phone), 태블릿(tablet) PC 등과 같은 휴대용 단말에 적용하기 어려운 문제점이 있다. 즉, 이와 같은 휴대용 단말에 데이터 중복 제거 기술을 적용하는 경우, 순차적으로 저장된 데이터의 물리적 순서가 바뀌게 되므로 데이터의 입출력 속도가 심각하게 느려지는 문제점이 있다.However, since the data de-duplication technique is performed based on a complicated algorithm for deduplication, it is difficult to apply to a portable terminal such as a notebook, a smart phone, a tablet PC, . That is, when the data de-duplication technique is applied to such a portable terminal, the physical order of stored data sequentially changes, so that data input / output speed is seriously slowed down.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 데이터의 입출력 특성에 따라 적응적으로 결정된 중복 제거율을 기반으로 데이터의 중복을 제거하기 위한 데이터 중복 제거 방법을 제공하는 데 있다.An object of the present invention is to provide a data deduplication method for eliminating duplication of data based on a deduplication rate determined adaptively according to input / output characteristics of data.

상기와 같은 문제점을 해결하기 위한 본 발명의 다른 목적은, 데이터의 입출력 특성에 따라 적응적으로 결정된 중복 제거율을 기반으로 데이터의 중복을 제거하기 위한 데이터 중복 제거 장치를 제공하는 데 있다.Another object of the present invention is to provide a data deduplication apparatus for eliminating duplication of data based on a deduplication rate determined adaptively according to input / output characteristics of data.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 데이터의 중복 제거 방법은, 데이터의 입력 요청 또는 출력 요청을 기반으로 상기 데이터에 대한 접근 특성을 획득하는 단계, 상기 접근 특성을 기반으로 상기 데이터의 중복 제거 단위를 결정하는 단계, 및 상기 중복 제거 단위를 기반으로 상기 데이터에 대한 중복 제거를 수행하는 단계를 포함한다.According to an aspect of the present invention, there is provided a method of deduplicating data according to an embodiment of the present invention, including: obtaining an access characteristic for data based on a data input request or an output request; Determining a deduplication unit of the deduplication unit, and performing deduplication on the data based on the deduplication unit.

여기서, 상기 접근 특성은, 상기 데이터에 대한 접근 시간, 상기 데이터에 대한 변경 시간, 상기 데이터에 대한 순차적 접근 횟수 및 상기 데이터에 대한 임의적 접근 횟수 중 적어도 하나를 포함할 수 있다.Here, the access characteristic may include at least one of an access time for the data, a change time for the data, a sequential access count for the data, and an arbitrary access count for the data.

여기서, 상기 접근 특성을 획득하는 단계는, 상기 데이터의 입력 요청을 수신한 경우, 상기 데이터의 입력 요청에 대한 시간 정보를 기반으로 상기 접근 시간 및 상기 변경 시간을 획득하는 단계, 및 상기 데이터의 입력 요청의 연속성을 기반으로 상기 순차적 접근 횟수 또는 상기 임의적 접근 횟수를 획득하는 단계를 포함할 수 있다.The step of acquiring the access characteristic may include acquiring the access time and the change time based on time information of the data input request when the data input request is received, And obtaining the number of sequential accesses or the number of arbitrary accesses based on the continuity of the request.

여기서, 상기 접근 특성을 획득하는 단계는, 상기 데이터의 출력 요청을 수신한 경우, 상기 데이터의 출력 요청에 대한 시간 정보를 기반으로 상기 접근 시간을 획득하는 단계, 및 상기 데이터의 출력 요청의 연속성을 기반으로 상기 순차적 접근 횟수 또는 상기 임의적 접근 횟수를 획득하는 단계를 포함할 수 있다.The step of acquiring the access characteristic may include acquiring the access time based on time information of the output request of the data when receiving the output request of the data, Acquiring the number of sequential accesses or the number of arbitrary accesses based on the number of accesses.

여기서, 상기 중복 제거 단위를 결정하는 단계는, 상기 데이터에 대한 현재 접근 시간과 상기 데이터에 대한 이전 변경 시간에 대한 제1 차이를 산출하는 단계, 상기 제1 차이가 미리 정의된 제1 시간 이하인 경우, 상기 중복 제거 단위를 데이터에 대한 중복 제거 가능성이 가장 낮은 제4 중복 제거 단위로 결정하는 단계, 상기 제1 차이가 미리 정의된 제1 시간을 초과하는 경우, 상기 데이터에 대한 현재 접근 시간과 상기 데이터에 대한 이전 접근 시간에 대한 제2 차이를 산출하는 단계, 상기 제2 차이가 미리 정의된 제2 시간을 초과하는 경우, 상기 중복 제거 단위를 데이터에 대한 중복 제거 가능성이 가장 높은 제1 중복 제거 단위로 결정하는 단계, 상기 제2 차이가 미리 정의된 제2 시간 이하인 경우, 상기 데이터에 대한 임의적 접근 횟수가 순차적 접근 횟수 이상이면 상기 중복 제거 단위를 제1 중복 제거 단위보다 중복 제거 가능성이 낮은 제2 중복 제거 단위로 결정하는 단계, 및 상기 제2 차이가 미리 정의된 제2 시간 이하인 경우, 상기 데이터에 대한 임의적 접근 횟수가 순차적 접근 횟수 미만이면 상기 중복 제거 단위를 제2 중복 제거 단위보다 중복 제거 가능성이 낮은 제3 중복 제거 단위로 결정하는 단계를 포함할 수 있다.The step of determining the deduplication unit may include calculating a first difference between a current access time of the data and a previous modification time of the data, if the first difference is equal to or less than a first predetermined time , Determining the deduplication unit as a fourth deduplication unit having the lowest possibility of deduplication for the data, if the first difference exceeds a predefined first time, Calculating a second difference with respect to a previous access time of the data when the second difference exceeds a second predetermined time; A step of determining a number of random accesses to the data when the second difference is equal to or less than a second predetermined time, Determining a second deduplication unit that is less likely to be deduplicated than the first deduplication unit if the first difference is equal to or greater than a predetermined number; and if the second difference is less than a second predetermined time, If the number of times is less than the sequential access count, the step of determining the duplication elimination unit may be a third duplication elimination unit having a lower possibility of duplication elimination than the second duplication elimination unit.

여기서, 상기 제4 중복 제거 단위는, 상기 데이터에 대한 중복 제거를 수행하지 않는 것을 의미할 수 있다.Here, the fourth deduplication unit may mean that duplication elimination is not performed on the data.

여기서, 상기 데이터에 대한 중복 제거를 수행하는 단계는, 상기 중복 제거 단위를 기반으로 상기 데이터에 대한 적어도 하나의 데이터 블록을 생성하는 단계, 상기 데이터 블록에 대한 고유의 식별자를 생성하는 단계, 상기 고유의 식별자가 인덱스 테이블 내에 존재하는지 판단하는 단계, 상기 고유의 식별자가 상기 인덱스 테이블 내에 존재하는 경우, 상기 고유의 식별자에 대응된 데이터 블록을 제거하는 단계, 및 상기 고유의 식별자가 상기 인덱스 테이블 내에 존재하지 않는 경우, 상기 고유의 식별자와 상기 고유의 식별자에 대응된 데이터 블록을 저장하는 단계를 포함할 수 있다.The step of performing deduplication on the data may include generating at least one data block for the data based on the deduplication unit, generating a unique identifier for the data block, Determining whether an identifier of the unique identifier exists in the index table, removing the data block corresponding to the unique identifier if the unique identifier exists in the index table, and if the unique identifier exists in the index table And if not, storing the unique identifier and the data block corresponding to the unique identifier.

여기서, 상기 고유의 식별자를 생성하는 단계는, 해시 알고리즘을 사용하여 상기 데이터 블록에 대한 고유의 식별자를 생성할 수 있다.The generating of the unique identifier may generate a unique identifier for the data block using a hash algorithm.

여기서, 상기 중복 제거 단위는, 데이터에 대한 중복 제거 가능성을 기반으로 적어도 하나의 중복 제거 단위로 분류될 수 있다.Here, the deduplication unit may be classified into at least one deduplication unit based on the possibility of deduplication of data.

상기 다른 목적을 달성하기 위한 본 발명의 일 실시예에 따른 데이터 중복 제거 장치는, 데이터의 요청 입력 또는 출력 요청을 기반으로 상기 데이터에 대한 접근 특성을 획득하고, 상기 접근 특성을 기반으로 상기 데이터의 중복 제거 단위를 결정하고, 상기 중복 제거 단위를 기반으로 상기 데이터에 대한 중복 제거를 수행하는 처리부, 및 상기 처리부에서 처리되는 정보 및 처리된 정보를 저장하는 저장부를 포함한다.According to another aspect of the present invention, there is provided a data de-duplication apparatus for acquiring an access characteristic for data based on a data input request or an output request, A processing unit for determining a deduplication unit and performing deduplication on the data based on the deduplication unit, and a storage unit for storing information processed in the processing unit and processed information.

여기서, 상기 처리부는, 상기 데이터의 입력 요청을 수신한 경우, 상기 데이터의 입력 요청에 대한 시간 정보를 기반으로 상기 접근 시간 및 상기 변경 시간을 획득하고, 상기 데이터의 입력 요청의 연속성을 기반으로 상기 순차적 접근 횟수 또는 상기 임의적 접근 횟수를 획득할 수 있다.Here, when the data input request is received, the processor acquires the access time and the change time based on time information of the data input request, and based on the continuity of the data input request, It is possible to obtain the number of sequential accesses or the number of arbitrary accesses.

여기서, 상기 처리부는, 상기 데이터의 출력 요청을 수신한 경우, 상기 데이터의 출력 요청에 대한 시간 정보를 기반으로 상기 접근 시간을 획득하고, 상기 데이터의 출력 요청의 연속성을 기반으로 상기 순차적 접근 횟수 또는 상기 임의적 접근 횟수를 획득할 수 있다.Here, when the data output request is received, the processing unit acquires the access time based on the time information of the data output request, and based on the sequential access count of the data, The arbitrary access frequency can be obtained.

여기서, 상기 처리부는, 상기 중복 제거 단위를 결정하는 경우, 상기 데이터에 대한 현재 접근 시간과 상기 데이터에 대한 이전 변경 시간에 대한 제1 차이를 산출하고, 상기 제1 차이가 미리 정의된 제1 시간 이하인 경우 상기 중복 제거 단위를 데이터에 대한 중복 제거 가능성이 가장 낮은 제4 중복 제거 단위로 결정하고, 상기 제1 차이가 미리 정의된 제1 시간을 초과하는 경우 상기 데이터에 대한 현재 접근 시간과 상기 데이터에 대한 이전 접근 시간에 대한 제2 차이를 산출하고, 상기 제2 차이가 미리 정의된 제2 시간을 초과하는 경우 상기 중복 제거 단위를 데이터에 대한 중복 제거 가능성이 가장 높은 제1 중복 제거 단위로 결정하고, 상기 제2 차이가 미리 정의된 제2 시간 이하인 경우, 상기 데이터에 대한 임의적 접근 횟수가 순차적 접근 횟수 이상이면 상기 중복 제거 단위를 제1 중복 제거 단위보다 중복 제거 가능성이 낮은 제2 중복 제거 단위로 결정하고, 상기 제2 차이가 미리 정의된 제2 시간 이하인 경우, 상기 데이터에 대한 임의적 접근 횟수가 순차적 접근 횟수 미만이면 상기 중복 제거 단위를 제2 중복 제거 단위보다 중복 제거 가능성이 낮은 제3 중복 제거 단위로 결정할 수 있다.Herein, when determining the deduplication unit, the processor calculates a first difference between a current access time of the data and a previous change time of the data, and when the first difference is less than a first predetermined time A second deduplication unit which deduces the deduplication unit as a fourth deduplication unit having the lowest possibility of deduplication with respect to the data, and when the first difference exceeds the first predetermined time, If the second difference exceeds the second predetermined time, determines the deduplication unit as a first deduplication unit having the highest possibility of deduplication for the data And when the second difference is equal to or less than the second predetermined time, the number of random accesses to the data is equal to or greater than the number of sequential accesses Wherein the second deduplication unit determines the second deduplication unit as a second deduplication unit having a lower possibility of deduplication than the first deduplication unit, and when the second difference is equal to or less than a second predetermined time, If it is less than the number of times, the duplication elimination unit may be determined as a third duplication elimination unit having a lower possibility of duplication elimination than the second duplication elimination unit.

여기서, 상기 처리부는, 상기 데이터에 대한 중복 제거를 수행하는 경우, 상기 중복 제거 단위를 기반으로 상기 데이터에 대한 적어도 하나의 데이터 블록을 생성하고, 상기 데이터 블록에 대한 고유의 식별자를 생성하고, 상기 고유의 식별자가 인덱스 테이블 내에 존재하는지 판단하고, 상기 고유의 식별자가 상기 인덱스 테이블 내에 존재하는 경우 상기 고유의 식별자에 대응된 데이터 블록을 제거하고, 상기 고유의 식별자가 상기 인덱스 테이블 내에 존재하지 않는 경우 상기 고유의 식별자와 상기 고유의 식별자에 대응된 데이터 블록을 저장할 수 있다.Here, the processing unit may generate at least one data block for the data based on the deduplication unit, generate a unique identifier for the data block when the data is deduplicated, And if the unique identifier is present in the index table, removes a data block corresponding to the unique identifier, and if the unique identifier does not exist in the index table And may store the data block corresponding to the unique identifier and the unique identifier.

여기서, 상기 처리부는, 상기 고유의 식별자를 생성하는 경우, 해시 알고리즘을 사용하여 상기 데이터 블록에 대한 고유의 식별자를 생성할 수 있다.Here, when the unique identifier is generated, the processing unit may generate a unique identifier for the data block using a hash algorithm.

본 발명에 의하면, 데이터의 입출력 특성을 기반으로 중복 제거율(즉, 청크(chunk)의 크기)을 적응적으로 결정할 수 있고, 적응적으로 결정된 중복 제거율을 기반으로 데이터의 중복을 제거할 수 있으므로, 낮은 입출력 레이턴시(latency)를 제공할 수 있다.According to the present invention, it is possible to adaptively determine the redundancy removal rate (that is, the size of the chunk) based on the input / output characteristics of data, and to eliminate redundancy of data based on the adaptively determined redundancy removal rate, And can provide low input / output latency.

도 1은 본 발명의 일 실시예에 따른 데이터 중복 제거 방법을 도시한 흐름도이다.
도 2는 데이터에 대한 접근 특성을 나타낸 표이다.
도 3은 본 발명의 일 실시예에 따른 데이터 중복 제거 방법에 있어서 접근 특성 획득 단계를 도시한 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 데이터 중복 제거 방법에 있어서 중복 제거 단위 결정 단계를 도시한 흐름도이다.
도 5는 중복 제거 단위에 대한 특성을 나타낸 표이다.
도 6은 본 발명의 일 실시예에 따른 데이터 중복 제거 방법에 있어서 중복 제거 수행 단계를 도시한 흐름도이다.
도 7은 데이터 블록에 대한 고유의 식별자를 생성하는 과정을 도시한 개념도이다.
도 8은 본 발명의 일 실시예에 따른 데이터 중복 제거 장치를 도시한 블록도이다.1 is a flowchart illustrating a data de-duplication method according to an embodiment of the present invention.
2 is a table showing access characteristics to data.
3 is a flowchart illustrating an access characteristic acquisition step in a data de-duplication method according to an embodiment of the present invention.
4 is a flowchart illustrating a deduplication unit determination step in a data deduplication method according to an embodiment of the present invention.
5 is a table showing the characteristics of the deduplication unit.
FIG. 6 is a flowchart illustrating a de-duplication step in the data de-duplication method according to an embodiment of the present invention.
7 is a conceptual diagram illustrating a process of generating a unique identifier for a data block.
8 is a block diagram illustrating a data de-duplication apparatus according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다.While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail.

그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the relevant art and are to be interpreted in an ideal or overly formal sense unless explicitly defined in the present application Do not.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In order to facilitate the understanding of the present invention, the same reference numerals are used for the same constituent elements in the drawings and redundant explanations for the same constituent elements are omitted.

도 1은 본 발명의 일 실시예에 따른 데이터 중복 제거 방법을 도시한 흐름도이다.1 is a flowchart illustrating a data de-duplication method according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 데이터 중복 제거 방법은 데이터의 입력 요청 또는 출력 요청을 기반으로 데이터에 대한 접근 특성을 획득하는 단계(S100), 접근 특성을 기반으로 데이터의 중복 제거 단위를 결정하는 단계(S200), 및 중복 제거 단위를 기반으로 데이터에 대한 중복 제거를 수행하는 단계(S300)를 포함한다.Referring to FIG. 1, a data deduplication method according to an exemplary embodiment of the present invention includes acquiring an access characteristic of data based on a data input request or an output request (S100) Determining a removal unit (S200), and performing a deduplication (S300) on the data based on the deduplication unit.

여기서, 도 1에 도시된 각 단계는 도 8에 도시된 데이터 중복 제거 장치에서 수행될 수 있으며, 데이터 중복 제거 장치의 구체적인 구성과 그 기능에 대해서는 후술하도록 한다.Here, each step shown in FIG. 1 can be performed in the data de-duplication device shown in FIG. 8, and the specific structure and function of the data de-duplication device will be described later.

도 2는 데이터에 대한 접근 특성을 나타낸 표이다.2 is a table showing access characteristics to data.

도 2를 참조하면, 데이터에 대한 접근 특성은 데이터에 대한 접근 시간(a_time), 데이터에 대한 변경 시간(m_time), 데이터에 대한 순차적 접근 횟수(seqCount), 데이터에 대한 임의적 접근 횟수(randCount) 등을 포함할 수 있다. 여기서, 데이터에 대한 접근 특성은 상기 정보 등에 한정되지 않고 데이터의 입력 또는 출력에 대한 특성을 나타낼 수 있는 정보를 포함할 수 있다.Referring to FIG. 2, an access characteristic for data includes an access time (a_time) for data, a change time (m_time) for data, a sequential access count (seqCount) for data, and an arbitrary access count (randCount) . &Lt; / RTI > Here, the access characteristic for data may include information that can indicate characteristics of input or output of data without being limited to the information and the like.

데이터에 대한 접근 시간(a_time)은 데이터의 입력 요청(즉, 데이터의 쓰기 요청) 또는 데이터의 출력 요청(즉, 데이터의 읽기 요청)을 수신한 시간을 의미한다. 데이터의 입력 요청을 받은 경우, 데이터 중복 제거 장치는 입력 요청 시간을 해당 데이터에 대한 접근 시간으로 획득할 수 있고, 획득한 접근 시간을 저장(즉, 현재 데이터에 대한 a_time 필드를 현재 시간으로 기록)할 수 있다. 한편, 이미 저장된 접근 시간이 있는 경우, 데이터 중복 제거 장치는 이미 저장된 접근 시간을 최근에 획득한 접근 시간으로 갱신할 수 있다.The access time (a_time) for data refers to the time at which a data input request (that is, a data write request) or a data output request (that is, a data read request) is received. In the case of receiving the data input request, the data de-duplication device can acquire the input request time as the access time for the data, stores the acquired access time (i.e., records the a_time field for the current data as the current time) can do. On the other hand, if there is already stored access time, the data de-duplication device can update the already stored access time with the recently obtained access time.

한편, 데이터의 출력 요청을 받은 경우, 데이터 중복 제거 장치는 출력 요청 시간을 해당 데이터에 대한 접근 시간으로 획득할 수 있고, 획득한 접근 시간을 저장(즉, 현재 데이터에 대한 a_time 필드를 현재 시간으로 기록)할 수 있다. 한편, 이미 저장된 접근 시간이 있는 경우, 데이터 중복 제거 장치는 이미 저장된 접근 시간을 최근에 획득한 접근 시간으로 갱신할 수 있다.On the other hand, when a data output request is received, the data de-duplication device can acquire the output request time as the access time for the data, and stores the acquired access time (i.e., the a_time field for the current data is the current time Recording). On the other hand, if there is already stored access time, the data de-duplication device can update the already stored access time with the recently obtained access time.

데이터에 대한 변경 시간(m_time)은 데이터의 입력 요청을 수신한 시간을 의미한다. 데이터의 입력 요청을 받은 경우, 데이터 중복 제거 장치는 입력 요청 시간을 해당 데이터에 대한 변경 시간으로 획득할 수 있고, 획득한 변경 시간을 저장(즉, 현재 데이터에 대한 m_time 필드를 현재 시간으로 기록)할 수 있다. 한편, 이미 저장된 변경 시간이 있는 경우, 데이터 중복 제거 장치는 이미 저장된 변경 시간을 최근에 획득한 변경 시간으로 갱신할 수 있다.The change time (m_time) for the data means the time when the data input request is received. When receiving the data input request, the data de-duplication device can acquire the input request time as the modification time for the data, and stores the acquired modification time (i.e., records the m_time field for the current data as the current time) can do. On the other hand, if there is already a change time already stored, the data de-duplication device can update the already stored change time with the recently obtained change time.

데이터에 대한 순차적 접근 횟수(seqCount)는 현재의 데이터 요청과 이전의 데이터 요청이 연속된(즉, 현재의 데이터 요청과 이전의 데이터 요청의 물리적 또는 논리적 블록번호가 연속적인 경우, 또는 그 요청이 연속적인 추세를 가지는 경우) 횟수를 의미할 수 있고, 데이터에 대한 임의적 접근 횟수(randCount)는 현재의 데이터 요청과 이전 데이터 요청이 연속되지 않은 횟수를 의미할 수 있다. 여기서, 순차적 접근 횟수와 임의적 접근 횟수는 현재까지 누적된 횟수를 의미할 수 있다.The number of sequential accesses to the data (seqCount) is the number of sequential accesses to the data (seqCount) if the current data request and the previous data request are contiguous (i.e., the physical or logical block numbers of the current data request and the previous data request are contiguous, And the number of random accesses to data (randCount) may mean the number of times the current data request and the previous data request are not consecutive. Here, the number of sequential accesses and the number of arbitrary accesses may indicate the accumulated number of times.

현재의 데이터 요청과 이전의 데이터 요청이 연속되는 경우, 데이터 중복 제거 장치는 현재 데이터에 대한 seqCount 필드의 값을 1 증가시킬 수 있다. 반면, 현재의 데이터 요청과 이전의 데이터 요청이 연속되지 않는 경우, 데이터 중복 제거 장치는 현재 데이터에 대한 randCount 필드의 값을 1 증가시킬 수 있다.If the current data request and the previous data request are consecutive, the data de-duplication device may increment the value of the seqCount field for the current data by one. On the other hand, if the current data request and the previous data request are not contiguous, the data deduplication device may increment the value of the randCount field for the current data by one.

도 3은 본 발명의 일 실시예에 따른 데이터 중복 제거 방법에 있어서 접근 특성 획득 단계를 도시한 흐름도이다.3 is a flowchart illustrating an access characteristic acquisition step in a data de-duplication method according to an embodiment of the present invention.

도 3을 참조하면, 데이터에 대한 접근 특성을 획득하는 단계(S100)는, 수신된 요청이 데이터의 입력 요청 또는 데이터의 출력 요청에 해당하는지 판단하는 단계(S110), 데이터의 입력 요청을 수신한 경우 데이터의 입력 요청에 대한 시간 정보를 기반으로 접근 시간 및 변경 시간을 획득하는 단계(S120), 및 데이터의 입력 요청의 연속성을 기반으로 순차적 접근 횟수 또는 임의적 접근 횟수를 획득하는 단계(S130)를 포함할 수 있다.Referring to FIG. 3, step S100 of acquiring an access characteristic for data includes determining whether the received request corresponds to an input request of data or an output request of data (S110) A step S120 of obtaining an access time and a change time based on time information of a data input request, and a step S130 of obtaining a sequential access frequency or an arbitrary access frequency based on the continuity of data input requests .

더불어, 데이터에 대한 접근 특성을 획득하는 단계(S100)는 데이터의 출력 요청을 수신한 경우 데이터의 출력 요청에 대한 시간 정보를 기반으로 접근 시간을 획득하는 단계(S140), 및 데이터의 출력 요청의 연속성을 기반으로 순차적 접근 횟수 또는 임의적 접근 횟수를 획득하는 단계(S150)를 포함할 수 있다.In addition, the step of acquiring an access characteristic for data (S100) includes a step (S140) of acquiring an access time based on time information of a data output request when receiving an output request of data (S140) And acquiring a sequential access frequency or an arbitrary access frequency based on the continuity (S150).

단계 S110에서, 데이터 중복 제거 장치는 수신된 요청이 데이터의 입력 요청에 해당하는지 데이터의 출력 요청에 해당하는지 판단할 수 있다. 수신된 요청이 데이터의 입력 요청에 해당하는 경우, 데이터 중복 제거 장치는 다음 단계로 단계 S120, 단계 S130을 수행할 수 있다. 반면, 수신된 요청이 데이터의 출력 요청에 해당하는 경우, 데이터 중복 제거 장치는 다음 단계로 단계 S140, 단계 S150을 수행할 수 있다.In step S110, the data de-duplication device can determine whether the received request corresponds to an input request of data or an output request of data. If the received request corresponds to a data input request, the data de-duplication device may perform steps S120 and S130 to the next step. On the other hand, if the received request corresponds to an output request of data, the data deduplication apparatus may perform steps S140 and S150 as the next step.

단계 S120에서, 데이터 중복 제거 장치는 데이터의 입력 요청에 대한 시간 정보를 기반으로 접근 시간 및 변경 시간을 획득할 수 있다. 즉, 데이터 중복 제거 장치는 데이터의 입력 요청을 수신한 시간을 접근 시간 및 변경 시간으로 획득할 수 있고, 획득한 접근 시간 및 획득한 변경 시간을 데이터베이스(database)에 저장할 수 있다. 이때, 이미 저장된 접근 시간이 존재하는 경우 데이터 중복 제거 장치는 이미 저장된 접근 시간을 상기 획득한 접근 시간으로 갱신할 수 있고, 이미 저장된 변경 시간이 존재하는 경우 데이터 중복 제거 장치는 이미 저장된 변경 시간을 상기 획득한 변경 시간으로 갱신할 수 있다.In step S120, the data de-duplication device may obtain the access time and the modification time based on the time information of the data input request. That is, the data de-duplication device can acquire the time of receiving the data input request as the access time and the modification time, and can store the acquired access time and the acquired modification time in the database. At this time, if there is already stored access time, the data deduplication device may update the already stored access time to the acquired access time, and if there is already stored the modification time, the data de- It can be updated with the obtained change time.

단계 S130에서, 데이터 중복 제거 장치는 데이터의 입력 요청과 해당 데이터에 대한 이전(以前) 요청의 연속성을 기반으로 순차적 접근 횟수 또는 임의적 접근 횟수를 획득할 수 있다. 즉, 데이터 중복 제거 장치는 데이터의 입력 요청과 해당 데이터에 대한 이전(以前) 요청이 연속되는 경우 순차적 접근 횟수를 1 증가시킬 수 있고, 데이터의 입력 요청과 해당 데이터에 대한 이전(以前) 요청이 연속되지 않는 경우 임의적 접근 횟수를 1 증가시킬 수 있다.In step S130, the data de-duplication device can acquire the sequential access frequency or the arbitrary access frequency based on the continuity of the data input request and the previous request for the data. That is, the data deduplication apparatus can increase the number of sequential accesses by one if the data input request and the previous request for the data are continuous, and the data input request and the previous request for the data If not, the number of random accesses can be increased by one.

예를 들어, 데이터베이스에 저장된 접근 특성 중 순차적 접근 횟수가 7 이고 임의적 접근 횟수가 5 인 경우, 데이터 중복 제거 장치는 데이터의 입력 요청과 해당 데이터에 대한 이전(以前) 요청이 연속되는 경우 순차적 접근 횟수를 8 로 갱신할 수 있고, 데이터의 입력 요청과 해당 데이터에 대한 이전(以前) 요청이 연속되지 않는 경우 임의적 접근 횟수를 6 으로 갱신할 수 있다.For example, if the number of sequential accesses among the access characteristics stored in the database is 7 and the number of arbitrary accesses is 5, the data de-duplication device calculates the number of sequential accesses when the data input request and previous (previous) Can be updated to 8, and the number of arbitrary accesses can be updated to 6 when the data input request and previous (previous) requests for the data are not consecutive.

여기서, 단계 S120을 먼저 수행한 후 단계 S130을 수행하는 것으로 설명하였으나, 단계 S120과 단계 S130의 수행 순서는 이에 한정되지 않는다. 즉, 단계 S130은 단계 S120과 동시에 수행될 수 있고, 또는 단계 S120보다 먼저 수행될 수 있다.Here, it is explained that the step S120 is performed first and then the step S130 is performed, but the order of performing the steps S120 and S130 is not limited thereto. That is, step S130 may be performed simultaneously with step S120, or may be performed before step S120.

단계 S140에서, 데이터 중복 제거 장치는 데이터의 출력 요청에 대한 시간 정보를 기반으로 접근 시간을 획득할 수 있다. 즉, 데이터 중복 제거 장치는 데이터의 출력 요청을 수신한 시간을 접근 시간으로 획득할 수 있고, 획득한 접근 시간을 데이터베이스에 저장할 수 있다. 이때, 이미 저장된 접근 시간이 존재하는 경우 데이터 중복 제거 장치는 이미 저장된 접근 시간을 상기 획득한 접근 시간으로 갱신할 수 있다.In step S140, the data de-duplication device may obtain the access time based on the time information of the output request of the data. That is, the data de-duplication device can acquire the time when the output request of the data is received as the access time, and store the acquired access time in the database. At this time, if the already stored access time exists, the data de-duplication device can update the already stored access time with the obtained access time.

단계 S150에서, 데이터 중복 제거 장치는 데이터의 출력 요청과 해당 데이터에 대한 이전(以前) 요청의 연속성을 기반으로 순차적 접근 횟수 또는 임의적 접근 횟수를 획득할 수 있다. 즉, 데이터 중복 제거 장치는 데이터의 출력 요청과 해당 데이터에 대한 이전(以前) 요청이 연속되는 경우 순차적 접근 횟수를 1 증가시킬 수 있고, 데이터의 입력 요청과 해당 데이터에 대한 이전(以前) 요청이 연속되지 않는 경우 임의적 접근 횟수를 1 증가시킬 수 있다.In step S150, the data deduplication apparatus may obtain the sequential access frequency or the arbitrary access frequency based on the data output request and the continuity of the previous request for the data. That is, the data de-duplication device can increase the number of sequential accesses by one if the data output request and previous (previous) requests for the data are consecutive, and the data input request and the previous request If not, the number of random accesses can be increased by one.

예를 들어, 데이터베이스에 저장된 접근 특성 중 순차적 접근 횟수가 7 이고 임의적 접근 횟수가 5 인 경우, 데이터 중복 제거 장치는 데이터의 출력 요청과 해당 데이터에 대한 이전(以前) 요청이 연속되는 경우 순차적 접근 횟수를 8 로 갱신할 수 있고, 데이터의 출력 요청과 해당 데이터에 대한 이전(以前) 요청이 연속되지 않는 경우 임의적 접근 횟수를 6 으로 갱신할 수 있다.For example, if the number of sequential accesses among the access characteristics stored in the database is 7 and the number of arbitrary accesses is 5, the data de-duplication device calculates the number of sequential accesses when the data output request and previous (previous) Can be updated to 8 and the number of arbitrary accesses can be updated to 6 when the output request of data and the previous request for the data are not consecutive.

여기서, 단계 S140을 먼저 수행한 후 단계 S150을 수행하는 것으로 설명하였으나, 단계 S140과 단계 S150의 수행 순서는 이에 한정되지 않는다. 즉, 단계 S150은 단계 S140과 동시에 수행될 수 있고, 또는 단계 S140보다 먼저 수행될 수 있다.Here, it is explained that step S140 is first performed and then step S150 is performed. However, the order of execution of steps S140 and S150 is not limited thereto. That is, step S150 may be performed simultaneously with step S140, or may be performed before step S140.

도 4는 본 발명의 일 실시예에 따른 데이터 중복 제거 방법에 있어서 중복 제거 단위 결정 단계를 도시한 흐름도이다.4 is a flowchart illustrating a deduplication unit determination step in a data deduplication method according to an embodiment of the present invention.

도 4를 참조하면, 중복 제거 단위를 결정하는 단계(S200)는, 데이터에 대한 현재 접근 시간과 데이터에 대한 이전(以前) 변경 시간에 대한 제1 차이를 산출하는 단계(S210), 제1 차이가 미리 정의된 제1 시간을 초과하는지 판단하는 단계(S220), 제1 차이가 미리 정의된 제1 시간 이하인 경우, 중복 제거 단위를 데이터에 대한 중복 제거 가능성이 가장 낮은 제4 중복 제거 단위로 결정하는 단계(S230), 제1 차이가 미리 정의된 제1 시간을 초과하는 경우, 데이터에 대한 현재 접근 시간과 데이터에 대한 이전(以前) 접근 시간에 대한 제2 차이를 산출하는 단계(S240), 제2 차이가 미리 정의된 제2 시간 이하인지 판단하는 단계(S250), 제2 차이가 미리 정의된 제2 시간을 초과하는 경우, 중복 제거 단위를 데이터에 대한 중복 제거 가능성이 가장 높은 제1 중복 제거 단위로 결정하는 단계(S260), 제2 차이가 미리 정의된 제2 시간 이하인 경우, 데이터에 대한 임의적 접근 횟수가 데이터에 대한 순차적 접근 횟수 이하인지 판단하는 단계(S270), 제2 차이가 미리 정의된 제2 시간 이하인 경우, 데이터에 대한 임의적 접근 횟수가 순차적 접근 횟수 이상이면 중복 제거 단위를 제1 중복 제거 단위보다 중복 제거 가능성이 낮은 제2 중복 제거 단위로 결정하는 단계(S280), 및 제2 차이가 미리 정의된 제2 시간 이하인 경우, 데이터에 대한 임의적 접근 횟수가 순차적 접근 횟수 미만이면 중복 제거 단위를 제2 중복 제거 단위보다 중복 제거 가능성이 낮은 제3 중복 제거 단위로 결정하는 단계(S290)을 포함할 수 있다. Referring to FIG. 4, a step S200 of determining a deduplication unit includes a step S210 of calculating a first difference between a current access time of data and a previous change time of data, (S220). If the first difference is equal to or smaller than the first predetermined time, the duplication elimination unit is determined as a fourth duplication elimination unit having the lowest possibility of deduplication for the data Calculating a second difference between a current access time for the data and a previous access time for the data when the first difference exceeds the predefined first time; S240, Determining whether the second difference is less than or equal to a second predetermined time (S250); if the second difference exceeds a second predetermined time, determining whether the second difference is less than a second predetermined time, Determined by elimination unit (S260). If the second difference is equal to or less than the second predetermined time, it is determined whether the number of discretionary accesses to the data is equal to or less than the sequential access count for the data (S270) (S280) when the number of arbitrary accesses to the data is equal to or greater than the sequential access count, determining (S280) the deduplication unit as a second deduplication unit having a lower possibility of deduplication than the first deduplication unit (S290) when the number of discretionary accesses to the data is less than the sequential access count, and determining the deduplication unit as a third deduplication unit having a lower possibility of deduplication than the second deduplication unit .

이하 단계 S200에서 결정되는 중복 제거 단위에 대해 도 5를 참조하여 상세하게 설명한다.Hereinafter, the duplicate elimination unit determined in step S200 will be described in detail with reference to FIG.

도 5는 중복 제거 단위에 대한 특성을 나타낸 표이다.5 is a table showing the characteristics of the deduplication unit.

도 5를 참조하면, 중복 제거 단위는 제1 중복 제거 단위, 제2 중복 제거 단위, 제3 중복 제거 단위, 제4 중복 제거 단위로 분류할 수 있다. 제1 중복 제거 단위는 자주 사용되지 않는 데이터에 적용될 수 있으며, 자주 사용되지 않는 데이터는 데이터에 대한 현재 접근 시간과 이전(以前) 접근 시간의 차이가 미리 정의된 임계값보다 큰 데이터를 의미할 수 있다. 제1 중복 제거 단위는 모든 중복 제거 단위 중 가장 작은 청크(chunk)를 사용하며, 이에 따라 모든 중복 제거 단위 중 가장 높은 데이터 중복 제거율을 제공할 수 있다. 즉, 제1 중복 제거 단위는 낮은 레이턴시(latency)보다 높은 데이터 중복 제거율을 제공하기 위해 사용될 수 있다.Referring to FIG. 5, the deduplication unit may be classified into a first deduplication unit, a second deduplication unit, a third deduplication unit, and a fourth deduplication unit. The first deduplication unit may be applied to infrequently used data, and infrequently used data may mean data having a difference between the current access time and the previous access time for the data is greater than the predefined threshold have. The first deduplication unit uses the smallest chunk of all deduplication units, thereby providing the highest data deduplication rate among all deduplication units. That is, the first deduplication unit may be used to provide a higher data deduplication rate than a lower latency.

제2 중복 제거 단위는 순차적 접근보다 임의적 접근이 자주 발생하는 데이터에 적용될 수 있다. 제2 중복 제거 단위는 제1 중복 제거 단위보다 크고 제3 중복 제거 단위보다 작은 크기의 청크를 사용하며, 이에 따라 모든 중복 제거 단위 중 상대적으로 높은 데이터 중복 제거율(즉, 제1 중복 제거 단위보다 낮고 제3 중복 제거 단위보다 높은 중복 제거율)을 제공할 수 있다. 즉, 임의적 접근이 자주 발생하는 데이터의 경우 물리적인 뒤틀림이 발생하여도 임의적 접근 성능에 문제가 발생하지 않기 때문에, 제2 중복 제거 단위는 높은 중복 제거율을 제공하기 위해 상대적으로 작은 크기의 청크를 사용할 수 있다.The second deduplication unit may be applied to data where arbitrary accesses occur more frequently than sequential accesses. The second deduplication unit uses a chunk having a size larger than the first deduplication unit and smaller than the third deduplication unit, and accordingly, a relatively high data deduplication ratio among all the deduplication units (i.e., lower than the first deduplication unit A higher deduplication rate than the third deduplication unit). In other words, in the case of data in which random access frequently occurs, the second deduplication unit can use a relatively small-sized chunk in order to provide a high deduplication rate because physical accessibility performance is not affected even if physical distortion occurs. .

제3 중복 제거 단위는 임의적 접근보다 순차적 접근이 자주 발생하는 데이터에 적용될 수 있다. 제3 중복 제거 단위는 제2 중복 제거 단위보다 큰 크기의 청크를 사용하며, 이에 따라 모든 중복 제거 단위 중 상대적으로 작은 중복 제거율(즉, 제2 중복 제거 단위보다 낮은 중복 제거율)을 제공할 수 있다. 즉, 순차적인 접근이 자주 발생하는 데이터에 대해 낮은 입출력 레이턴시를 제공하기 위해, 제3 중복 제거 단위는 상대적으로 큰 크기의 청크를 사용할 수 있다.The third deduplication unit may be applied to data where sequential access is more frequent than random access. The third deduplication unit uses chunks of a larger size than the second deduplication unit, thereby providing a relatively small deduplication rate (i.e., a lower deduplication rate than the second deduplication unit) among all deduplication units . That is, the third deduplication unit may use a relatively large chunk size to provide a low input / output latency for data where sequential access frequently occurs.

제4 중복 제거 단위는 입력이 자주 발생하는 데이터에 적용될 수 있다. 제4 중복 제거 단위는 제3 중복 제거 단위보다 큰 크기의 청크를 사용할 수 있으며, 이에 따라 모든 중복 제거 단위 중 가장 작은 중복 제거율(즉, 제3 중복 제거 단위보다 낮은 중복 제거율)을 제공할 수 있다. 한편, 제4 중복 제거 단위는 데이터에 대한 중복 제거를 수행하지 않는 것을 의미할 수도 있다. 즉, 데이터의 중복 제거는 출력 위주의 데이터에 유리하므로, 입력 위주의 데이터의 경우 중복 제거를 수행하지 않을 수 있다.The fourth deduplication unit may be applied to data where input is frequently generated. The fourth deduplication unit may use chunks of a size greater than the third deduplication unit, thereby providing the smallest deduplication rate among all deduplication units (i.e., a lower deduplication rate than the third deduplication unit) . On the other hand, the fourth deduplication unit may mean not performing deduplication on the data. In other words, deduplication of data is advantageous to output-oriented data, so that deduplication may not be performed for input-oriented data.

여기서, 중복 제거 단위의 분류는 상기 설명에 한정되지 아니하고 다양하게 구성될 수 있다. 예를 들어, 중복 제거 단위를 3개 분류 또는 5개 분류로 구성할 수 있다. 중복 제거 단위가 3개 분류로 구성되는 경우, 제1 중복 제거 단위는 가장 높은 중복 제거율을 제공할 수 있고, 제2 중복 제거 단위는 제1 중복 제거 단위보다 낮은 중복 제거율을 제공할 수 있고, 제3 중복 제거 단위는 제2 중복 제거 단위보다 낮은 중복 제거율(즉, 가장 낮은 중복 제거율)을 제공할 수 있다.Here, the classification of the deduplication unit is not limited to the above description, and may be variously configured. For example, deduplication units can be organized into three or five categories. If the deduplication unit is composed of three classes, the first deduplication unit may provide the highest deduplication rate, the second deduplication unit may provide a lower deduplication rate than the first deduplication unit, The triple deduplication unit may provide a lower deduplication rate (i.e., the lowest deduplication rate) than the second deduplication unit.

다시 도 4를 참조하면, 단계 S210에서, 데이터 중복 제거 장치는 데이터에 대한 현재 접근 시간과 데이터에 대한 이전(以前) 변경 시간에 대한 제1 차이를 산출할 수 있다. 즉, 데이터 중복 제거 장치는 데이터의 입력 요청 또는 출력 요청으로부터 획득한 현재 접근 시간과, 동일한 데이터에 대한 이전(以前) 변경 시간(즉, 데이터의 이전(以前) 입력 요청으로부터 획득한 변경 시간)의 차이인 제1 차이를 산출할 수 있다. 여기서, 제1 차이는 해당 데이터가 얼마나 자주 변경되는지(즉, 데이터의 입력 요청이 얼마나 자주 발생하는지)를 의미할 수 있다.Referring again to FIG. 4, in step S210, the deduplication device may calculate a first difference between a current access time for data and a previous change time for data. In other words, the data de-duplication device is capable of storing the current access time obtained from the input request or output request of data and the previous change time (i.e., the change time obtained from the previous (previous) input request of data) The first difference can be calculated. Here, the first difference may mean how frequently the corresponding data is changed (i.e., how often data input requests occur).

단계 S220에서, 데이터 중복 제거 장치는 제1 차이가 미리 정의된 제1 시간을 초과하는지 판단할 수 있다. 여기서, 미리 정의된 제1 시간은 데이터 중복 제거를 수행하기 위한 기준이 되는 시간을 의미하며, 사용자의 설정에 따라 다른 값을 가질 수 있다. 예를 들어, 미리 정의된 제1 시간은 1 시간, 2 시간, 3 시간 등으로 설정될 수 있다. 제1 차이가 미리 정의된 제1 시간 이하인 경우 데이터 중복 제거 장치는 다음 단계로 단계 S230을 수행할 수 있고, 제1 차이가 미리 정의된 제1 시간을 초과하는 경우 데이터 중복 제거 장치는 다음 단계로 단계 S240을 수행할 수 있다.In step S220, the data de-duplication device may determine whether the first difference exceeds a first predetermined time. Here, the predefined first time means a time to be a reference for performing data deduplication, and may have different values depending on the setting of the user. For example, the predefined first time may be set to one hour, two hours, three hours, and so on. If the first difference is less than or equal to the predefined first time, the data de-duplication device may perform step S230 to the next step, and if the first difference exceeds the predefined first time, the data de- Step S240 may be performed.

단계 S230에서, 데이터 중복 제거 장치는 중복 제거 단위를 데이터에 대한 중복 제거 가능성이 가장 낮은 제4 중복 제거 단위로 결정할 수 있다. 즉, 제1 차이가 미리 정의된 제1 시간 이하인 경우 이는 입력이 자주 발생하는 데이터를 의미하므로, 데이터 중복 제거 장치는 중복 제거 단위 중 데이터 중복 제거율이 가장 낮은(또는, 데이터 중복 제거를 수행하지 않는) 제4 중복 제거 단위를 선택할 수 있다.In step S230, the data deduplication unit may determine the deduplication unit as the fourth deduplication unit with the lowest possibility of deduplication for the data. That is, when the first difference is equal to or less than the first predetermined time, it means data that frequently occurs in input. Therefore, the data deduplication apparatus has the lowest data deduplication rate among the deduplication units (or, ) The fourth deduplication unit can be selected.

단계 S240에서, 데이터 중복 제거 장치는 데이터에 대한 현재 접근 시간과 데이터에 대한 이전(以前) 접근 시간에 대한 제2 차이를 산출할 수 있다. 즉, 데이터 중복 제거 장치는 데이터의 입력 요청 또는 출력 요청으로부터 획득한 현재 접근 시간과, 동일한 데이터에 대한 이전(以前) 접근 시간(즉, 데이터의 이전(以前) 입력 요청 또는 이전(以前) 출력 요청으로부터 획득한 접근 시간)의 차이인 제2 차이를 산출할 수 있다. 여기서, 제2 차이는 해당 데이터에 대한 접근이 얼마나 자주 발생하는지를 의미할 수 있다.In step S240, the deduplication device may calculate a second difference between a current access time for data and a previous access time for data. That is, the data de-duplication device receives the current access time obtained from the input request or output request of data and the previous access time (i.e., the previous (previous) input request or the previous (previous) Which is the difference between the first time and the second time. Here, the second difference may indicate how often access to the data occurs.

단계 S250에서, 데이터 중복 제거 장치는 제2 차이가 미리 정의된 제2 시간 이하인지 판단할 수 있다. 여기서, 미리 정의된 제2 시간은 데이터에 대한 접근이 발생할 가능성이 낮은 데이터를 구별하기 위해 기준이 되는 시간을 의미하며, 사용자의 설정에 따라 다른 값을 가질 수 있다. 예를 들어, 미리 정의된 제2 시간은 1 시간, 2 시간, 3 시간 등으로 설정될 수 있다. 제2 차이가 미리 정의된 제2 시간을 초과하는 경우 데이터 중복 제거 장치는 다음 단계로 단계 S260을 수행할 수 있고, 제2 차이가 미리 정의된 제2 시간 이하인 경우 데이터 중복 제거 장치는 다음 단계로 단계 S270을 수행할 수 있다.In step S250, the data de-duplication device may determine whether the second difference is a second predetermined time or less. Here, the predefined second time means a reference time for distinguishing data with low possibility of access to data, and may have different values depending on the setting of the user. For example, the predefined second time may be set to one hour, two hours, three hours, and so on. If the second difference exceeds the predefined second time, the data de-duplication device may perform step S260 to the next step, and if the second difference is less than or equal to the predefined second time, Step S270 may be performed.

단계 S260에서, 데이터 중복 제거 장치는 중복 제거 단위를 데이터에 대한 중복 제거 가능성이 가장 높은 제1 중복 제거 단위로 결정할 수 있다. 즉, 제2 차이가 미리 정의된 제2 시간을 초과하는 경우 이는 자주 사용되지 않는 데이터(즉, 접근 가능성이 낮은 데이터)를 의미하므로, 데이터 중복 제거 장치는 중복 제거 단위 중 데이터 중복 제거율이 가장 높은 제1 중복 제거 단위를 선택할 수 있다.In step S260, the data deduplication unit may determine the deduplication unit as the first deduplication unit with the highest possibility of deduplication for the data. That is, when the second difference exceeds the second predetermined time, it means data that is not frequently used (that is, data with low accessibility), so that the data deduplication apparatus has the highest data deduplication rate The first deduplication unit can be selected.

단계 S270에서, 데이터 중복 제거 장치는 데이터에 대한 임의적 접근 횟수가 데이터에 대한 순차적 접근 횟수 미만인지 판단할 수 있다. 임의적 접근 횟수가 순차적 접근 횟수 이상인 경우 데이터 중복 제거 장치는 다음 단계로 S280을 수행할 수 있고, 임의적 접근 횟수가 순차적 접근 횟수 미만인 경우 데이터 중복 제거 장치는 다음 단계로 단계 S290을 수행할 수 있다.In step S270, the data deduplication apparatus may determine whether the number of random accesses to data is less than the number of sequential accesses to data. If the random access frequency is equal to or greater than the sequential access frequency, the data deduplication device may perform step S280 to the next step. If the random access frequency is less than the sequential access frequency, the data de-duplication device may perform step S290 to the next step.

단계 S280에서, 데이터 중복 제거 장치는 중복 제거 단위를 제1 중복 제거 단위보다 중복 제거 가능성이 낮은 제2 중복 제거 단위로 결정할 수 있다. 즉, 데이터에 대한 임의적 접근 횟수가 순차적 접근 횟수 이상인 경우 이는 임의적 접근이 자주 발생하는 데이터를 의미하므로, 데이터 중복 제거 장치는 중복 제거 단위 중 데이터 중복 제거율이 상대적으로 높은 제2 중복 제거 단위를 선택할 수 있다.In step S280, the data deduplication unit may determine the deduplication unit as a second deduplication unit having a lower possibility of deduplication than the first deduplication unit. That is, when the number of arbitrary accesses to data is equal to or greater than the sequential access count, the data deduplication unit selects the second deduplication unit having a relatively high data deduplication rate among the deduplication units have.

단계 S290에서, 데이터 중복 제거 장치는 중복 제거 단위를 제2 중복 제거 단위보다 중복 제거 가능성이 낮은 제3 중복 제거 단위로 결정할 수 있다. 즉, 데이터에 대한 임의적 접근 횟수가 순차적 접근 횟수 미만인 경우는 순차적 접근이 자주 발생하는 데이터를 의미하므로, 데이터 중복 제거 장치는 중복 제거 단위 중 데이터 중복 제거율이 상대적으로 낮은 제3 중복 제거 단위를 선택할 수 있다.In step S290, the data deduplication unit may determine the deduplication unit as a third deduplication unit that is less likely to be deduplicated than the second deduplication unit. That is, when the number of arbitrary accesses to data is less than the number of sequential accesses, the data deduplication unit selects the third deduplication unit having a relatively low data deduplication rate have.

도 6은 본 발명의 일 실시예에 따른 데이터 중복 제거 방법에 있어서 중복 제거 수행 단계를 도시한 흐름도이고, 도 7은 데이터 블록에 대한 고유의 식별자를 생성하는 과정을 도시한 개념도이다.FIG. 6 is a flowchart illustrating a step of performing deduplication in the data deduplication method according to an embodiment of the present invention, and FIG. 7 is a conceptual diagram illustrating a process of generating a unique identifier for a data block.

이하 도 6 및 도 7을 참조하여, 중복 제거를 수행하는 단계(S300)에 대해 상세하게 설명한다.Hereinafter, the step of performing deduplication (S300) will be described in detail with reference to FIG. 6 and FIG.

본 발명의 일 실시예에 따른 데이터 중복 제거 방법에 있어서 중복 제거를 수행하는 단계(S300)는, 중복 제거 단위를 기반으로 데이터에 대한 적어도 하나의 데이터 블록을 생성하는 단계(S310), 데이터 블록에 대한 고유의 식별자(identifier)를 생성하는 단계(S320), 고유의 식별자가 인덱스 테이블(index table) 내에 존재하는지 판단하는 단계(S330), 고유의 식별자가 인덱스 테이블 내에 존재하지 않는 경우, 고유의 식별자와 고유의 식별자에 대응된 데이터 블록을 저장하는 단계(S340), 및 고유의 식별자가 인덱스 테이블 내에 존재하는 경우, 고유의 식별자에 대응된 데이터 블록을 제거하는 단계(S350)를 포함할 수 있다.In the data de-duplication method according to an embodiment of the present invention, performing de-duplication (S300) includes generating (S310) at least one data block for data based on a de-duplication unit (S320). If the unique identifier does not exist in the index table (S330), it is determined whether the unique identifier exists in the index table (S330). If the unique identifier does not exist in the index table (S340) of storing a data block corresponding to a unique identifier, and if the unique identifier exists in the index table, removing a data block corresponding to the unique identifier (S350).

단계 S310에서, 데이터 중복 제거 장치는 중복 제거 단위를 기반으로 데이터에 대한 적어도 하나의 데이터 블록을 생성할 수 있으며, 중복 제거 단위는 청크의 크기를 의미한다. 즉, 데이터 중복 제거 장치는 제1 중복 제거 단위에 대응된 청크의 크기를 기초로 데이터 블록을 생성할 수 있고, 데이터 중복 제거 장치는 제2 중복 제거 단위에 대응된 청크의 크기를 기초로 데이터 블록을 생성할 수 있고, 데이터 중복 제거 장치는 제3 중복 제거 단위에 대응된 청크의 크기를 기초로 데이터 블록을 생성할 수 있고, 데이터 중복 제거 장치는 제4 중복 제거 단위에 대응된 청크의 크기를 기초로 데이터 블록을 생성할 수 있다. 한편, 제4 중복 제거 단위가 데이터 중복 제거를 수행하지 않는 것을 의미하는 경우, 데이터 중복 제거 장치는 데이터에 대한 데이터 블록을 생성하지 않을 수 있다.In step S310, the data deduplication unit may generate at least one data block for data based on the deduplication unit, and the deduplication unit refers to the size of the chunks. That is, the data deduplication unit may generate a data block based on the size of the chunks corresponding to the first deduplication unit, and the data deduplication unit may generate the data block based on the size of the chunks corresponding to the second deduplication unit. And the data deduplication unit may generate a data block based on the size of the chunks corresponding to the third deduplication unit, and the data deduplication unit may generate the size of the chunks corresponding to the fourth deduplication unit It is possible to generate a data block on a basis. On the other hand, if the fourth deduplication unit implies not to perform deduplication, the deduplication apparatus may not generate a data block for the data.

상기에서 설명한 단계 S310을 기초로, 데이터 중복 제거 장치는 데이터(30, 도 7)로부터 복수의 데이터 블록(31, 도 7)을 생성할 수 있다.Based on the above-described step S310, the data de-duplication device can generate a plurality of data blocks 31 (Fig. 7) from the data 30 (Fig. 7).

단계 S320에서, 데이터 중복 제거 장치는 데이터 블록에 대한 고유의 식별자를 생성할 수 있다. 여기서, 데이터 중복 제거 장치는 해시 알고리즘(예를 들어, SHA-1, SHA-2, SHA-3 등)을 사용하여 데이터 블록에 대한 고유의 식별자를 생성할 수 있다. 데이터 블록에 대한 고유의 식별자를 생성하는 방법은 상기 설명에 한정되지 않고, 공지된 다양한 방법을 사용하여 데이터 블록에 대한 고유의 식별자를 생성할 수 있다.In step S320, the data de-duplication device may generate a unique identifier for the data block. Here, the data de-duplication device can generate a unique identifier for a data block using a hash algorithm (e.g., SHA-1, SHA-2, SHA-3, etc.). The method of generating the unique identifier for the data block is not limited to the above description, and various known methods can be used to generate a unique identifier for the data block.

상기에서 설명한 단계 S320을 기초로, 데이터 중복 제거 장치는 각각의 데이터 블록(31, 도 7)으로부터 고유의 식별자(32, 도 7)를 생성할 수 있다.Based on the above-described step S320, the data deduplication apparatus can generate a unique identifier 32 (FIG. 7) from each data block 31 (FIG. 7).

단계 S330에서, 데이터 중복 제거 장치는 고유의 식별자가 인덱스 테이블 내에 존재하는지 판단할 수 있다. 인덱스 테이블은 고유의 식별자와 고유의 식별자에 대응된 데이터 블록을 포함할 수 있다. 여기서, 고유의 식별자가 인덱스 테이블 내에 존재하는 경우 이는 고유의 식별자에 대응된 데이터 블록이 이미 저장되어 있음을 나타내고, 고유의 식별자가 인덱스 테이블 내에 존재하지 않는 경우 이는 고유의 식별자에 대응된 데이터 블록이 저장되어 있지 않음을 나타낸다. 고유의 식별자가 인덱스 테이블 내에 존재하지 않는 경우 데이터 중복 제거 장치는 다음 단계로 단계 S340을 수행할 수 있고, 고유의 식별자가 인덱스 테이블 내에 존재하는 경우 데이터 중복 제거 장치는 다음 단계로 단계 S350을 수행할 수 있다.In step S330, the deduplication device may determine whether a unique identifier exists in the index table. The index table may include a data block corresponding to a unique identifier and a unique identifier. Here, when a unique identifier exists in the index table, it indicates that the data block corresponding to the unique identifier is already stored. If the unique identifier does not exist in the index table, the data block corresponding to the unique identifier It is not stored. If the unique identifier does not exist in the index table, the data de-duplication device can perform step S340 to the next step, and if the unique identifier exists in the index table, the data de-duplication device performs step S350 to the next step .

단계 S340에서, 데이터 중복 제거 장치는 고유의 식별자와 고유의 식별자에 대응된 데이터 블록을 저장할 수 있다. 즉, 고유의 식별자에 대응된 데이터 블록이 저장되어 있지 않은 상태이므로, 데이터 중복 제거 장치는 중복 제거를 수행하지 않고 고유의 식별자와 데이터 블록을 데이터베이스(또는, 인덱스 테이블)에 저장할 수 있다.In step S340, the data de-duplication device may store a data block corresponding to a unique identifier and a unique identifier. That is, since the data block corresponding to the unique identifier is not stored, the data de-duplication device can store the unique identifier and the data block in the database (or the index table) without performing the deduplication.

단계 S350에서, 데이터 중복 제거 장치는 고유의 식별자에 대응된 데이터 블록을 제거할 수 있다. 즉, 고유의 식별자에 대응된 데이터 블록이 이미 저장되어 있는 상태이므로, 데이터 중복 제거 장치는 중복 제거(즉, 고유의 식별자에 대응된 데이터 블록 삭제)를 수행할 수 있다.In step S350, the data de-duplication device may remove the data block corresponding to the unique identifier. That is, since the data block corresponding to the unique identifier is already stored, the data de-duplication device can perform deduplication (i.e., deletion of the data block corresponding to the unique identifier).

본 발명의 일 실시예에 따른 데이터 중복 제거 방법들은 다양한 컴퓨터 수단을 통해 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램 명령은 본 발명을 위해 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.The data deduplication methods according to an embodiment of the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer readable medium. The computer readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the computer readable medium may be those specially designed and constructed for the present invention or may be available to those skilled in the computer software.

컴퓨터 판독 가능 매체의 예에는 롬(rom), 램(ram), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함한다. 상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 적어도 하나의 소프트웨어 모듈로 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of computer readable media include hardware devices that are specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those generated by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate with at least one software module to perform the operations of the present invention, and vice versa.

도 8은 본 발명의 일 실시예에 따른 데이터 중복 제거 장치를 도시한 블록도이다.8 is a block diagram illustrating a data de-duplication apparatus according to an embodiment of the present invention.

도 8을 참조하면, 본 발명의 일 실시예에 따른 데이터 중복 제거 장치는 처리부(10) 및 저장부(20)를 포함한다.Referring to FIG. 8, a data de-duplication apparatus according to an embodiment of the present invention includes a processing unit 10 and a storage unit 20.

처리부(10)는 데이터의 입력 요청 또는 출력 요청을 기반으로 데이터에 대한 접근 특성을 획득할 수 있고, 접근 특성을 기반으로 데이터의 중복 제거 단위를 결정할 수 있고, 중복 제거 단위를 기반으로 데이터에 대한 중복 제거를 수행할 수 있다.The processing unit 10 can acquire access characteristics of data based on a data input request or an output request, determine a data deduplication unit based on an access characteristic, Deduplication can be performed.

여기서, 처리부(10)는 논리적 구성인 접근 특성 획득부(11), 중복 제거 단위 결정부(12), 중복 제거 수행부(13) 및 인덱스 테이블 관리부(14)를 포함할 수 있다. 한편, 처리부(60)는 물리적 구성인 프로세서(processor) 및 메모리(memory)를 포함할 수 있다. 프로세서는 범용의 프로세서(예를 들어, CPU(Central Processing Unit) 및/또는 GPU(Graphics Processing Unit) 등) 또는 데이터 중복 제거 방법의 수행을 위한 전용의 프로세서를 의미할 수 있다. 메모리에는 데이터 중복 제거 방법의 수행을 위한 프로그램 코드(program code)가 저장될 수 있다. 즉, 프로세서는 메모리에 저장된 프로그램 코드를 독출할 수 있고, 독출된 프로그램 코드를 기반으로 데이터 중복 제거 방법의 각 단계를 수행할 수 있다.Here, the processing unit 10 may include an access characteristic acquisition unit 11, a deduplication unit determination unit 12, a deduplication performance unit 13, and an index table management unit 14, which are logical structures. Meanwhile, the processing unit 60 may include a processor and a memory, which are physical configurations. A processor may refer to a general purpose processor (e.g., a Central Processing Unit (CPU) and / or a Graphics Processing Unit (GPU)) or a dedicated processor for performing data deduplication methods. A program code for performing the data de-duplication method may be stored in the memory. That is, the processor can read the program code stored in the memory, and can perform each step of the data de-duplication method based on the read program code.

여기서, 데이터에 대한 접근 특성은 데이터에 대한 접근 시간(a_time, 도 2 참조), 데이터에 대한 변경 시간(m_time, 도 2 참조), 데이터에 대한 순차적 접근 횟수(seqCount, 도 2 참조), 데이터에 대한 임의적 접근 횟수(randCount, 도 2 참조) 등을 포함할 수 있다. 여기서, 데이터에 대한 접근 특성은 상기 정보 등에 한정되지 않고 데이터의 입력 또는 출력에 대한 특성을 나타낼 수 있는 정보를 포함할 수 있다.2), the number of sequential accesses to the data (seqCount, see Fig. 2), and the number of accesses to the data (see Fig. 2) (RandCount, see FIG. 2), and the like. Here, the access characteristic for data may include information that can indicate characteristics of input or output of data without being limited to the information and the like.

데이터에 대한 접근 시간(a_time)은 데이터의 입력 요청(즉, 데이터의 쓰기 요청) 또는 데이터의 출력 요청(즉, 데이터의 읽기 요청)을 수신한 시간을 의미한다. 데이터에 대한 변경 시간(m_time)은 데이터의 입력 요청을 수신한 시간을 의미한다. 데이터에 대한 순차적 접근 횟수(seqCount)는 현재의 데이터 요청과 이전의 데이터 요청이 연속되는(즉, 현재의 데이터 요청과 이전의 데이터 요청이 동일함) 횟수를 의미하고, 데이터에 대한 임의적 접근 횟수(randCount)는 현재의 데이터 요청과 이전의 데이터 요청이 연속되지 않는(즉, 현재의 데이터 요청과 이전의 데이터 요청이 동일하지 않음) 횟수를 의미한다.The access time (a_time) for data refers to the time at which a data input request (that is, a data write request) or a data output request (that is, a data read request) is received. The change time (m_time) for the data means the time when the data input request is received. The number of sequential accesses to data (seqCount) means the number of times the current data request and the previous data request are contiguous (i.e., the current data request is the same as the previous data request), and the number of arbitrary accesses to the data randCount) is the number of times the current data request and the previous data request are not contiguous (i.e., the current data request and the previous data request are not the same).

접근 특성을 획득하는 경우, 처리부(10)는 데이터의 입력 요청에 대한 시간 정보를 기반으로 접근 시간 및 변경 시간을 획득할 수 있고, 데이터의 입력 요청의 연속성을 기반으로 순차적 접근 횟수 또는 임의적 접근 횟수를 획득할 수 있다. 또한, 처리부(10)는 데이터의 출력 요청에 대한 시간 정보를 기반으로 접근 시간을 획득할 수 있고, 데이터의 출력 요청의 연속성을 기반으로 순차적 접근 횟수 또는 임의적 접근 횟수를 획득할 수 있다. 여기서, 접근 특성을 획득하는 과정은 처리부(10) 내의 접근 특성 획득부(11)에서 수행될 수 있다.In the case of acquiring the access characteristic, the processing unit 10 may acquire the access time and the modification time based on the time information of the data input request, and may acquire the sequential access frequency or the random access frequency Can be obtained. In addition, the processing unit 10 can acquire the access time based on the time information of the data output request, and can acquire the sequential access frequency or the arbitrary access frequency based on the continuity of the data output request. Here, the process of acquiring the access characteristic may be performed in the access characteristic acquiring unit 11 in the processing unit 10. [

구체적으로, 처리부(10)는 수신된 요청이 데이터의 입력 요청에 해당하는지 데이터의 출력 요청에 해당하는지 판단할 수 있다. 수신된 요청이 데이터의 입력 요청에 해당하는 경우, 처리부(10)는 데이터의 입력 요청을 수신한 시간을 접근 시간 및 변경 시간으로 획득할 수 있고, 획득한 접근 시간 및 획득한 변경 시간을 저장부(20)에 저장할 수 있다. 더불어, 처리부(10)는 데이터의 입력 요청과 해당 데이터에 대한 이전(以前) 요청이 연속되는 경우 순차적 접근 횟수를 1 증가시킬 수 있고, 데이터의 입력 요청과 해당 데이터에 대한 이전(以前) 요청이 연속되지 않는 경우 임의적 접근 횟수를 1 증가시킬 수 있다.Specifically, the processing unit 10 may determine whether the received request corresponds to an input request of data or an output request of data. When the received request corresponds to an input request of data, the processing unit 10 can acquire the time at which the data input request is received as the access time and the change time, and stores the obtained access time and the obtained change time in the storage unit (20). In addition, the processing unit 10 may increase the number of sequential accesses by one if the data input request and the previous request for the data are continuous, and the data input request and the previous request for the data If not, the number of random accesses can be increased by one.

한편, 수신된 요청이 데이터의 출력 요청에 해당하는 경우, 처리부(10)는 데이터의 출력 요청을 수신한 시간을 접근 시간으로 획득할 수 있고, 획득한 접근 시간을 데이터베이스에 저장할 수 있다. 더불어, 처리부(10)는 데이터의 출력 요청과 해당 데이터에 대한 이전(以前) 요청이 연속되는 경우 순차적 접근 횟수를 1 증가시킬 수 있고, 데이터의 입력 요청과 해당 데이터에 대한 이전(以前) 요청이 연속되지 않는 경우 임의적 접근 횟수를 1 증가시킬 수 있다.On the other hand, when the received request corresponds to an output request of data, the processing unit 10 can acquire the time when the request for outputting the data is received as the access time, and store the acquired access time in the database. In addition, the processing unit 10 may increase the number of sequential accesses by one if the data output request and the previous request for the data are consecutive, and the data input request and the previous request for the data If not, the number of random accesses can be increased by one.

중복 제거 단위를 결정하는 경우, 처리부(10)는 데이터에 대한 현재 접근 시간과 데이터에 대한 이전(以前) 변경 시간에 대한 제1 차이를 산출할 수 있고, 제1 차이가 미리 정의된 제1 시간 이하인 경우 중복 제거 단위를 데이터에 대한 중복 제거 가능성이 가장 낮은 제4 중복 제거 단위로 결정할 수 있다. 한편, 제1 차이가 미리 정의된 제1 시간을 초과하는 경우, 처리부(10)는 데이터에 대한 현재 접근 시간과 데이터에 대한 이전(以前) 접근 시간에 대한 제2 차이를 산출할 수 있다.When determining the deduplication unit, the processing unit 10 may calculate a first difference for the current access time for data and a previous change time for the data, and if the first difference is less than a predefined first time The deduplication unit can be determined as the fourth deduplication unit with the lowest possibility of deduplication for the data. On the other hand, when the first difference exceeds the first predetermined time, the processing unit 10 may calculate the second difference between the current access time for the data and the previous access time for the data.

여기서, 처리부(10)는 제2 차이가 미리 정의된 제2 시간을 초과하는 경우 중복 제거 단위를 데이터에 대한 중복 제거 가능성이 가장 높은 제1 중복 제거 단위로 결정할 수 있고, 제2 차이가 미리 정의된 제2 시간 이하인 경우 데이터에 대한 임의적 접근 횟수가 순차적 접근 횟수 이상인지 판단할 수 있다.Here, when the second difference exceeds the second predetermined time, the processing unit 10 may determine the deduplication unit as a first deduplication unit having the highest possibility of deduplication for the data, It is possible to determine whether the number of random accesses to data is equal to or greater than the number of sequential accesses.

여기서, 처리부(10)는 데이터에 대한 임의적 접근 횟수가 순차적 접근 횟수 이상인 경우 중복 제거 단위를 제1 중복 제거 단위보다 중복 제거 가능성이 낮은 제2 중복 제거 단위로 결정할 수 있고, 데이터에 대한 임의적 접근 횟수가 순차적 접근 횟수 미만인 경우 중복 제거 단위를 제2 중복 제거 단위보다 중복 제거 가능성이 낮은 제3 중복 제거 단위로 결정할 수 있다.Here, if the number of random access attempts to the data is equal to or greater than the sequential access count, the processing unit 10 can determine the deduplication elimination unit as a second deduplication unit having a lower possibility of deduplication than the first deduplication unit, Is less than the sequential access count, the deduplication unit may be determined as the third deduplication unit having a lower possibility of deduplication than the second deduplication unit.

상기에서 설명한 중복 제거 단위를 결정하는 과정은 처리부(10) 내의 중복 제거 단위 결정부(12)에서 수행될 수 있다.The process of determining the deduplication unit described above can be performed in the deduplication unit determination unit 12 in the processing unit 10. [

여기서, 중복 제거 단위는 제1 중복 제거 단위, 제2 중복 제거 단위, 제3 중복 제거 단위, 제4 중복 제거 단위로 분류할 수 있다. 제1 중복 제거 단위는 자주 사용되지 않는 데이터에 적용될 수 있고, 제2 중복 제거 단위는 순차적 접근보다 임의적 접근이 자주 발생하는 데이터에 적용될 수 있고, 제3 중복 제거 단위는 임의적 접근보다 순차적 접근이 자주 발생하는 데이터에 적용될 수 있고, 제4 중복 제거 단위는 입력이 자주 발생하는 데이터에 적용될 수 있다.Here, the deduplication unit may be classified into a first deduplication unit, a second deduplication unit, a third deduplication unit, and a fourth deduplication unit. The first deduplication unit may be applied to data that is not frequently used, and the second deduplication unit may be applied to data where arbitrary accesses occur more frequently than the sequential access, and the third deduplication unit may be more frequently accessed And the fourth deduplication unit can be applied to the data in which the input is frequently generated.

구체적으로, 처리부(10)는 데이터의 입력 요청 또는 출력 요청으로부터 획득한 현재 접근 시간과, 동일한 데이터에 대한 이전(以前) 변경 시간(즉, 데이터의 이전(以前) 입력 요청으로부터 획득한 변경 시간)의 차이인 제1 차이를 산출할 수 있다. 여기서, 제1 차이는 해당 데이터가 얼마나 자주 변경되는지(즉, 데이터의 입력 요청이 얼마나 자주 발생하는지)를 의미할 수 있다.Specifically, the processing unit 10 determines whether or not the current access time acquired from the data input request or the output request and the previous change time (i.e., the change time obtained from the previous (previous) input request of the data) The first difference can be calculated. Here, the first difference may mean how frequently the corresponding data is changed (i.e., how often data input requests occur).

처리부(10)는 제1 차이가 미리 정의된 제1 시간을 초과하는지 판단할 수 있다. 여기서, 미리 정의된 제1 시간은 데이터 중복 제거를 수행하기 위한 기준이 되는 시간을 의미하며, 사용자의 설정에 따라 다른 값을 가질 수 있다.The processing unit 10 may determine whether the first difference exceeds a first predetermined time. Here, the predefined first time means a time to be a reference for performing data deduplication, and may have different values depending on the setting of the user.

제1 차이가 미리 정의된 제1 시간 이하인 경우 이는 입력이 자주 발생하는 데이터를 의미하므로, 처리부(10)는 중복 제거 단위 중 데이터 중복 제거율이 가장 낮은(또는, 데이터 중복 제거를 수행하지 않는) 제4 중복 제거 단위를 선택할 수 있다.When the first difference is equal to or less than the first predetermined time, this means data that frequently occurs in input. Therefore, the processing unit 10 determines whether or not the data duplication elimination rate is the lowest 4 You can select a deduplication unit.

한편, 제1 차이가 미리 정의된 제1 시간을 초과하는 경우, 처리부(10)는 데이터의 입력 요청 또는 출력 요청으로부터 획득한 현재 접근 시간과, 동일한 데이터에 대한 이전(以前) 접근 시간(즉, 데이터의 이전(以前) 입력 요청 또는 이전(以前) 출력 요청으로부터 획득한 접근 시간)의 차이인 제2 차이를 산출할 수 있다. 여기서, 제2 차이는 해당 데이터에 대한 접근이 얼마나 자주 발생하는지를 의미할 수 있다.On the other hand, when the first difference exceeds the first predetermined time, the processing unit 10 compares the current access time obtained from the data input request or the output request with the previous access time (i.e., The access time obtained from the previous (previous) input request of the data or from the previous (previous) output request). Here, the second difference may indicate how often access to the data occurs.

처리부(10)는 제2 차이가 미리 정의된 제2 시간 이하인지 판단할 수 있다. 여기서, 미리 정의된 제2 시간은 데이터에 대한 접근이 발생할 가능성이 낮은 데이터를 구별하기 위해 기준이 되는 시간을 의미하며, 사용자의 설정에 따라 다른 값을 가질 수 있다.The processing unit 10 may determine whether the second difference is a second predetermined time or shorter. Here, the predefined second time means a reference time for distinguishing data with low possibility of access to data, and may have different values depending on the setting of the user.

제2 차이가 미리 정의된 제2 시간을 초과하는 경우 이는 자주 사용되지 않는 데이터(즉, 접근 가능성이 낮은 데이터)를 의미하므로, 처리부(10)는 중복 제거 단위 중 데이터 중복 제거율이 가장 높은 제1 중복 제거 단위를 선택할 수 있다. 한편, 제2 차이가 미리 정의된 제2 시간 이하인 경우, 처리부(10)는 데이터에 대한 임의적 접근 횟수가 데이터에 대한 순차적 접근 횟수 미만인지 판단할 수 있다.When the second difference exceeds the second predetermined time, it means data that is not frequently used (that is, data with low accessibility) You can choose a deduplication unit. On the other hand, if the second difference is equal to or less than the second predetermined time, the processing unit 10 can determine whether the number of random accesses to data is less than the number of sequential accesses to data.

데이터에 대한 임의적 접근 횟수가 순차적 접근 횟수 이상인 경우 이는 임의적 접근이 자주 발생하는 데이터를 의미하므로, 처리부(10)는 중복 제거 단위 중 데이터 중복 제거율이 상대적으로 높은 제2 중복 제거 단위를 선택할 수 있다. 한편, 데이터에 대한 임의적 접근 횟수가 순차적 접근 횟수 미만인 경우는 순차적 접근이 자주 발생하는 데이터를 의미하므로, 처리부(10)는 중복 제거 단위 중 데이터 중복 제거율이 상대적으로 낮은 제3 중복 제거 단위를 선택할 수 있다.When the number of arbitrary accesses to data is equal to or greater than the sequential access count, this means data in which random access frequently occurs. Therefore, the processor 10 can select a second deduplication unit having a relatively high data deduplication rate. On the other hand, when the number of arbitrary accesses to data is less than the sequential access count, the processing unit 10 can select a third deduplication unit having a relatively low data deduplication rate among the deduplication units have.

데이터의 중복 제거를 수행하는 경우, 처리부(10)는 중복 제거 단위를 기반으로 데이터에 대한 적어도 하나의 데이터 블록을 생성할 수 있고, 데이터 블록에 대한 고유의 식별자를 생성할 수 있고, 고유의 식별자가 인덱스 테이블 내에 존재하는지 판단할 수 있고, 고유의 식별자가 인덱스 테이블 내에 존재하는 경우 고유의 식별자에 대응된 데이터 블록을 제거할 수 있고, 고유의 식별자가 인덱스 테이블 내에 존재하지 않는 경우 고유의 식별자와 고유의 식별자에 대응된 데이터 블록을 저장할 수 있다.When performing deduplication of data, the processing unit 10 can generate at least one data block for the data based on the deduplication unit, generate a unique identifier for the data block, And if a unique identifier exists in the index table, the data block corresponding to the unique identifier can be removed, and if the unique identifier does not exist in the index table, A data block corresponding to a unique identifier can be stored.

여기서, 데이터의 중복을 제거하는 과정은 처리부(10) 내의 중복 제거 수행부(13)에서 수행될 수 있고, 인덱스 테이블 내에 정보를 저장, 삭제, 갱신하는 과정은 처리부(10) 내의 인덱스 테이블 관리부(14)에서 수행될 수 있다.The process of removing duplication of data may be performed by the deduplication performing unit 13 in the processing unit 10 and the process of storing, deleting, and updating information in the index table may be performed by an index table management unit 14). &Lt; / RTI >

구체적으로, 처리부(10)는 제1 중복 제거 단위에 대응된 청크의 크기를 기초로 데이터 블록을 생성할 수 있고, 또는 제2 중복 제거 단위에 대응된 청크의 크기를 기초로 데이터 블록을 생성할 수 있고, 또는 제3 중복 제거 단위에 대응된 청크의 크기를 기초로 데이터 블록을 생성할 수 있고, 또는 제4 중복 제거 단위에 대응된 청크의 크기를 기초로 데이터 블록을 생성할 수 있다. 한편, 제4 중복 제거 단위가 데이터 중복 제거를 수행하지 않는 것을 의미하는 경우, 처리부(10)는 데이터에 대한 데이터 블록을 생성하지 않을 수 있다.Specifically, the processing unit 10 may generate a data block based on the size of the chunk corresponding to the first deduplication unit or generate a data block based on the size of the chunks corresponding to the second deduplication unit Or may generate a data block based on the size of the chunks corresponding to the third deduplication unit or a data block based on the size of the chunks corresponding to the fourth deduplication unit. On the other hand, when the fourth deduplication unit indicates that data de-duplication is not performed, the processing unit 10 may not generate a data block for data.

처리부(10)는 해시 알고리즘(예를 들어, SHA-1, SHA-2, SHA-3 등)을 사용하여 데이터 블록에 대한 고유의 식별자를 생성할 수 있다. 데이터 블록에 대한 고유의 식별자를 생성하는 방법은 상기 설명에 한정되지 않고, 공지된 다양한 방법을 사용하여 데이터 블록에 대한 고유의 식별자를 생성할 수 있다.The processing unit 10 may generate a unique identifier for the data block using a hash algorithm (e.g., SHA-1, SHA-2, SHA-3, etc.). The method of generating the unique identifier for the data block is not limited to the above description, and various known methods can be used to generate a unique identifier for the data block.

처리부(10)는 고유의 식별자가 인덱스 테이블 내에 존재하는지 판단할 수 있다. 고유의 식별자가 인덱스 테이블 내에 존재하지 않는 경우 이는 고유의 식별자에 대응된 데이터 블록이 저장되어 있지 않은 상태이므로, 처리부(10)는 중복 제거를 수행하지 않고 고유의 식별자와 데이터 블록을 저장부(20)에 저장할 수 있다. 한편, 고유의 식별자가 인덱스 테이블 내에 존재하는 경우 이는 고유의 식별자에 대응된 데이터 블록이 이미 저장되어 있는 상태이므로, 처리부(10)는 중복 제거(즉, 고유의 식별자에 대응된 데이터 블록 삭제)를 수행할 수 있다.The processing unit 10 can determine whether a unique identifier exists in the index table. If the unique identifier does not exist in the index table, it means that the data block corresponding to the unique identifier is not stored. Therefore, the processing unit 10 does not perform the deduplication but stores the unique identifier and the data block in the storage unit 20 ). &Lt; / RTI > On the other hand, when a unique identifier exists in the index table, since the data block corresponding to the unique identifier is already stored, the processing unit 10 deletes (i.e., deletes the data block corresponding to the unique identifier) Can be performed.

저장부(20)는 처리부(10)에서 처리되는 정보 및 처리된 정보를 저장할 수 있다. 예를 들어, 저장부(20)는 데이터의 입력 요청, 데이터의 출력 요청, 데이터에 대한 접근 특성, 제1 차이, 미리 정의된 제1 시간, 제2 차이, 미리 정의된 제2 시간, 인덱스 테이블, 중복 제거 단위 정보 등을 저장할 수 있다.The storage unit 20 may store information processed in the processing unit 10 and processed information. For example, the storage unit 20 may store an input request for data, an output request for data, an access characteristic for data, a first difference, a first predetermined time, a second difference, a predefined second time, , Duplicate removal unit information, and the like.

이상 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined in the appended claims. It will be possible.

10: 처리부
11: 접근 특성 획득부
12: 중복 제거 단위 결정부
13: 중복 제거 수행부
14: 인덱스 테이블 관리부
20: 저장부10:
11: Access characteristic acquisition unit
12: Duplicate removal unit determination unit
13: Duplicate removal performing unit
14: Index table manager
20:

Claims

A data deduplication method performed in a data deduplication apparatus,
Obtaining an access characteristic for the data based on a data input request or an output request;
Determining a deduplication unit of the data based on the access characteristic; And
Generating a data block for the data based on the deduplication unit, generating a unique identifier for the data block, and based on whether the unique identifier exists in an index table And performing deduplication on the data.

The method according to claim 1,
The access characteristic may be,
Wherein the at least one of the access time and the data access time includes at least one of an access time for the data, a change time for the data, a sequential access count for the data, and an arbitrary access count for the data.

The method of claim 2,
Wherein the obtaining of the access characteristic comprises:
Acquiring the access time and the modification time based on time information of the data input request when the data input request is received; And
And acquiring the number of sequential accesses or the number of arbitrary accesses based on the continuity of the input request of the data.

The method of claim 2,
Wherein the obtaining of the access characteristic comprises:
Acquiring the access time based on time information of an output request of the data when receiving the output request of the data; And
And obtaining the number of sequential accesses or the number of arbitrary accesses based on the continuity of the output request of the data.

The method of claim 2,
Wherein the step of determining the de-
Calculating a first difference between a current access time for the data and a previous change time for the data;
Determining the deduplication unit as a fourth deduplication unit having the lowest possibility of deduplication for data when the first difference is equal to or less than a first predetermined time;
Calculating a second difference between a current access time for the data and a previous access time for the data if the first difference exceeds a predefined first time;
If the second difference exceeds a second predetermined time, determining the deduplication unit as a first deduplication unit having the highest possibility of deduplication for data;
If the second difference is equal to or less than the second predetermined time, if the number of random access attempts to the data is equal to or greater than the sequential access count, the duplication elimination unit is determined as a second duplication elimination unit having a lower possibility of duplication elimination than the first duplication elimination unit ; And
If the second difference is less than or equal to the second predetermined time, if the number of discretionary accesses to the data is less than the number of sequential accesses, the duplicate removal unit is determined as a third duplication elimination unit having a lower possibility of duplication elimination than the second duplication elimination unit The method comprising the steps of:

The method of claim 5,
Wherein the fourth deduplication unit comprises:
Wherein the data de-duplication means does not perform de-duplication on the data.

The method according to claim 1,
Wherein the step of performing deduplication on the data comprises:
Determining whether the unique identifier is present in the index table;
If the unique identifier is present in the index table, removing the data block corresponding to the unique identifier; And
And storing a data block corresponding to the unique identifier and the unique identifier if the unique identifier does not exist in the index table.

The method of claim 7,
Wherein generating the unique identifier comprises:
Wherein a hash algorithm is used to generate a unique identifier for the data block.

The method according to claim 1,
The de-
Wherein the data deduplication unit is classified into at least one deduplication unit based on the possibility of deduplication of data.

The method includes obtaining an access characteristic for the data based on a request input or output request of data, determining a deduplication unit of the data based on the access characteristic, and generating a data block for the data based on the deduplication unit block, generating a unique identifier for the data block, and performing deduplication on the data based on whether the unique identifier is present in an index table; And
And a storage unit for storing information processed and processed by the processing unit.

The method of claim 10,
The access characteristic may be,
A data access time for the data, a change time for the data, a sequential access time for the data, and an arbitrary access time for the data.

The method of claim 11,
Wherein,
Wherein the control unit acquires the access time and the change time based on the time information of the data input request when the data input request is received and updates the access time and the change time based on the sequential access count or the arbitrary And obtains the number of accesses.

The method of claim 11,
Wherein,
Acquiring the access time based on the time information of the output request of the data when the output request of the data is received and acquiring the sequential access number or the random access number based on the continuity of the output request of the data Wherein the data de-duplication unit comprises:

The method of claim 11,
Wherein,
A first difference between a current access time of the data and a previous change time of the data is calculated, and when the first difference is equal to or less than a predefined first time, Determining a deduplication unit as a fourth deduplication unit having the lowest possibility of deduplication for data, and if the first difference exceeds a predefined first time, transferring the current access time for the data and the migration If the second difference exceeds a second predetermined time, determines the duplication elimination unit as a first duplication elimination unit having the highest possibility of deduplication with respect to the data and,
If the second difference is equal to or less than the second predetermined time, if the number of random access attempts to the data is equal to or greater than the sequential access count, the duplication elimination unit is determined as a second duplication elimination unit having a lower possibility of duplication elimination than the first duplication elimination unit and,
If the second difference is less than or equal to the second predetermined time, if the number of discretionary accesses to the data is less than the number of sequential accesses, the duplicate removal unit is determined as a third duplication elimination unit having a lower possibility of duplication elimination than the second duplication elimination unit Wherein the data de-duplication unit comprises:

15. The method of claim 14,
Wherein the fourth deduplication unit comprises:
Wherein the data de-duplication means does not perform de-duplication of the data.

The method of claim 10,
Wherein,
When the unique identifier is present in the index table, removing the data block corresponding to the unique identifier if the unique identifier exists in the index table, And stores the data block corresponding to the unique identifier and the unique identifier when the unique identifier is not present in the index table.

18. The method of claim 16,
Wherein,
Wherein the hash algorithm is used to generate a unique identifier for the data block when generating the unique identifier.

The method of claim 10,
The de-
Wherein the data deduplication unit is classified into at least one deduplication unit based on the possibility of deduplication of data.