KR101652436B1

KR101652436B1 - Apparatus for data de-duplication in a distributed file system and method thereof

Info

Publication number: KR101652436B1
Application number: KR1020100079160A
Authority: KR
Inventors: 김승민
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2010-08-17
Filing date: 2010-08-17
Publication date: 2016-08-30
Also published as: KR20120016747A

Abstract

본 발명은 분산파일 시스템에서의 중복 제거 기술에 관한 것으로, 데이터 안정성 및 성능 향상을 위해 파일별로 복제본들을 여러 저장공간으로 분산하여 생성 및 관리하는 분산 파일 시스템에서 서로 다른 사용자들이 생성하는 파일들의 복제 본을 유사 파일 그룹으로 구성하여 이들을 각각 그룹에 할당된 파일 서버, 즉 저장 매체 미디어에 저장한 후에 해당 저장 매체 미디어별로 중복 제거를 수행하는 것을 특징으로 한다. 본 발명에 의하면, 분사파일 시스템에서 파일들의 확장자 및 콘텐츠 타입 등에 대한 분석을 통해 유사파일로 분석된 파일들을 유사 그룹으로 구성하고, 유사 그룹별로 할당된 저장 매체 미디어에 저장하여 해당 저장 매체 미디어별로 중복 제거를 수행함으로써, 동일한 파일이라도 다른 명칭을 가진 파일들에 대한 중복 제거를 가능하게 하고, 이를 통해 저장 매체 미디어의 저장 공간을 효율적으로 절약할 수 있다.The present invention relates to a deduplication technique in a distributed file system, and in a distributed file system in which replicas are distributed and stored in various storage spaces for each file in order to improve data stability and performance, Are grouped into a similar file group, and are stored in a file server, i.e., a storage medium, respectively, assigned to the group, and are then deduplicated for each storage medium. According to the present invention, the files analyzed by the similar file are analyzed into similar groups by analyzing the extension of the files and the content type in the injection file system, and the files are stored in the storage media allocated for each similar group, Removal of duplicate files can be performed for files having different names even if the same file is used, thereby efficiently saving the storage space of the storage media.

Description

[0001] APPARATUS FOR DATA DE-DUPLICATION IN A DISTRIBUTED FILE SYSTEM AND METHOD THEREOF [0002]

본 발명은 분산파일 시스템에서의 중복 제거 장치 및 방법에 관한 것으로서, 더욱 상세하게는 데이터 안정성 및 성능 향상을 위해 파일별로 복제본들을 여러 저장공간으로 분산하여 생성 및 관리하는 분산 파일 시스템에서 서로 다른 사용자들이 생성하는 파일들의 복제 본을 유사 파일 그룹으로 구성하여 이들을 각각 그룹에 할당된 디스크에 저장한 후에 해당 디스크 단위로 중복 제거를 수행하는데 적합한 분산파일 시스템에서의 중복 제거 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for deduplication in a distributed file system, and more particularly, to a system and method for deduplication in a distributed file system in which replicas are generated and managed in a plurality of storage spaces for each file in order to improve data stability and performance, And more particularly, to an apparatus and method for deduplication in a distributed file system suitable for duplicating copies of files to be created in a similar file group and storing them in a disk allocated to a group, respectively, and performing deduplication on a per disk basis.

일반적으로, 범용 하드웨어 기반의 분산 파일 시스템에서는 데이터의 안정성 및 성능 향상을 위해 파일별로 복제본(replica)들을 여러 저장장치에 각각 분산하여 생성 및 관리를 수행한다.Generally, in a general-purpose hardware-based distributed file system, replicas are generated and managed on a per-file basis in order to improve data stability and performance.

이러한 분산 파일 시스템은 동일한 저장 용량을 가지는 저장장치에 복제본을 각각 분산시켜 저장하는 대칭 구조, 다양한 저장 용량을 가지는 저장장치 및 다양한 저장 매체 미디어에 복제본들을 저장하는 비대칭 구조 등이 있다. Such a distributed file system includes a symmetric structure in which replicas are distributed and stored in a storage device having the same storage capacity, a storage device having various storage capacities, and an asymmetric structure for storing replicas in various storage media media.

특히, 범용 하드웨어 기반의 분산파일시스템은 상대적으로 저렴한 디스크 미디어를 사용함으로써 발생할 수 있는 데이터 불안정성 및 성능 저하를 완화시키기 위하여 원 파일의 복제 본을 원 파일이 존재하는 디스크 이외의 다른 디스크/컴퓨터에 최소 1개 이상 가지게 된다. 이로 인해 발생하는 디스크 공간의 증가를 막기 위해 사용하는 기술이 중복 제거 기술이다.Particularly, in order to mitigate data instability and performance degradation caused by the use of relatively inexpensive disk media, a general-purpose hardware-based distributed file system requires a copy of the original file to be copied to a disk / computer other than the disk on which the original file exists One or more. Duplication technology is used to prevent the increase of disk space caused by this.

중복 제거 기술은 디스크 또는 파일의 구성 단위인 블록들에 대해 적용되는 기술로, 서로 다른 파일에 속해 있는 블록이어도 동일한 블록이라면 이를 중복 저장하지 않고, 하나의 블록만 유지한 후 이에 대한 링크 정보를 유지하여 데이터 저장 공간을 줄이는 방법이다.Deduplication technology is applied to blocks which are constituent units of a disk or a file. If the same block is a block belonging to a different file, it does not store duplicate blocks, but keeps only one block and maintains link information Thereby reducing the data storage space.

예를 들어, 메일프로그램에서 다수의 사용자에게 전송한 첨부 파일은 각 사용자의 컴퓨터에 저장될 수 있으나, 중복 제어 기술을 통해 사용자의 디스크/ 컴퓨터에 1개의 파일을 저장하고, 그 1개의 파일을 가리키는 링크 정보 100개를 유지함으로써, 스토리지 용량을 효과적으로 절약할 수 있다.For example, an attachment sent to a number of users in a mail program may be stored on each user's computer, but one file may be stored on the user's disk / computer via redundant control techniques, By keeping the link information 100, the storage capacity can be effectively saved.

상기한 바와 같이 동작하는 종래 기술에 의한 중복제거 기술에 있어서는, 연동된 복수의 디스크 장치 내에서 동일한 파일명 및 확장자를 가진 파일들에 대해 전체 디스크 내에서의 검색을 통해 불필요한 파일을 제거하거나, 각 파일의 블록 별 동일 여부를 판단하여 동일한 블록들에 대한 중복 여부를 판단하였으나, 동일한 파일명 및 확장자를 가진 파일들에 대한 중복 제거 방식은 각 파일의 파일명이나 확장자가 동일하지 않은 경우에는 중복되지 않은 파일로 판단되었으며, 파일의 블록별 동일 여부 분석은 저장된 파일의 개수가 늘어날수록 시스템 상에 과부하 및 신속한 중복 제거를 수행할 수 없다는 문제점이 있었다.In the conventional deduplication technique, the unnecessary files are removed from the files in the plurality of disk apparatuses having the same file name and extension by searching the entire disk, The duplication elimination method for files having the same file name and extension is performed in the case where the file name or extension of each file is not the same, As a result, it is impossible to perform overloading and quick duplication removal on the system as the number of stored files increases.

이에 본 발명은 데이터 안정성 및 성능 향상을 위해 파일별로 복제본들을 여러 저장공간으로 분산하여 생성 및 관리하는 분산 파일 시스템에서 서로 다른 사용자들이 생성하는 파일들의 복제 본을 유사 파일 그룹으로 구성하여 이들을 각각 그룹에 할당된 파일 서버에 저장한 후에 해당 파일 서버 단위로 중복 제거를 수행할 수 있는 분산파일 시스템에서의 중복 제거 장치 및 방법을 제공한다. Accordingly, in order to improve data stability and performance, a distributed file system in which replicas are distributed and stored in a plurality of storage spaces on a file-by-file basis constitutes a replica of files generated by different users as similar file groups, The present invention provides a deduplication apparatus and method in a distributed file system capable of performing deduplication on a per file server basis after being stored in an allocated file server.

또한 본 발명은 각 파일에 대한 콘텐츠 타입을 확인하여 유사 파일들을 검색하고, 검색된 유사 파일들을 각각 그룹으로 구성하여 그룹별로 할당된 저장 매체 미디어에 저장하고, 해당 저장 매체 미디어별로 중복 제거를 수행할 수 있는 분산파일 시스템에서의 중복 제거 장치 및 방법을 제공한다.Also, according to the present invention, it is possible to search for similar files by checking the content type of each file, organize the retrieved similar files into groups, store the retrieved similar files in the storage mediums allocated for each group, The present invention provides a deduplication apparatus and method in a distributed file system.

본 발명의 일 실시예 장치는, 복수의 파일 서버에 저장된 각 파일들 간의 타입 정보를 토대로 유사 여부를 판단하는 중복파일 검사부와, 판단된 유사 파일들을 그룹으로 생성하고, 생성된 유사 파일 그룹을 파일 서버에 할당하는 유사파일 그룹 구성부와, 상기 중복 파일 검사부를 통해 판단된 유사 여부를 토대로 각 유사 파일 그룹 내 중복된 파일을 제거하는 중복 제거부를 포함한다.An apparatus of an embodiment of the present invention includes a duplicate file checking unit for determining similarity based on type information between files stored in a plurality of file servers, And a duplicate elimination unit for eliminating redundant files in each similar file group based on the similarity determined through the duplicate file checking unit.

그리고 상기 타입 정보는, 파일명, 파일 확장자명, 파일 크기, 파일 종류, 메타데이터 정보 중 적어도 하나 인 것을 특징으로 한다.The type information is at least one of a file name, a file extension name, a file size, a file type, and metadata information.

그리고 상기 중복파일 검사부는, 각 파일간 파일명 및 파일의 크기가 동일한 경우, 파일명이 다르나 파일의 크기가 동일한 경우, 파일 크기 혹은 파일 명이 다른 상태이나 적어도 하나의 블록이 동일한 경우 중 어느 하나 인 경우 유사 파일로 판단하는 것을 특징으로 한다.If the file name and the file size are the same but the file size or the file name is different or at least one block is the same, And judges as a file.

그리고 상기 유사파일 그룹 구성부는, 파일 확장자별, 파일의 종류별, 파일의 중요도별, 파일의 사용 빈도수별 중 적어도 하나의 기준 방식으로 그룹을 구성하는 것을 특징으로 한다.The pseudo-file group constructing unit may form a group according to at least one of a file extension, a file type, a file importance, and a file usage frequency.

그리고 상기 유사파일 그룹 구성부는, 각 파일 서버에 적어도 하나의 유사 파일 그룹을 할당하여 저장하는 것을 특징으로 한다.The pseudo file group constructing unit may allocate and store at least one pseudo file group to each file server.

그리고 상기 중복 제거부는, 상기 각 유사 파일 그룹 내 중복된 파일 중 기 설정된 개수를 초과하는 파일을 제거하는 것을 특징으로 한다.And the deduplication unit removes a file exceeding a predetermined number of duplicated files in the similar file group.

그리고 상기 중복 제거부는, 상기 각 유사 파일 그룹 내 각 유사 파일에서 적어도 하나의 블록이 동일한 경우, 기설정된 개수를 초과하는 동일 블록을 제거하는 것을 특징으로 한다.The duplicate elimination unit may remove identical blocks exceeding a predetermined number when at least one block is identical in each similar file in each similar file group.

본 발명의 일 실시예 방법은, 중복 파일 검사부에서 복수의 파일 서버에 분산 저장된 각 파일들의 타입 정보를 토대로 유사여부를 판단하는 과정과, 유사 파일 그룹 구성부에서 상기 중복 파일 검사부를 통해 유사한 것으로 판단된 파일들을 그룹으로 생성하고, 생성된 유사 파일 그룹을 파일 서버에 할당하는 과정과, 중복 제거부에서 상기 중복 파일 검사부를 통해 판단된 유사 여부를 토대로 각 유사 파일 그룹 내 중복된 파일을 제거하는 과정을 포함한다.The method of the present invention includes the steps of: determining similarity based on type information of each of files distributed and stored in a plurality of file servers in a duplicate file checking unit; determining similarity in the similar file grouping unit through the duplicate file checking unit; A step of creating duplicated files in a group and assigning the generated similar file group to a file server, a process of removing redundant files in each similar file group based on the similarity determined through the duplicate file checking unit in the duplicate removal unit .

그리고 상기 유사여부를 판단하는 과정은, 각 파일간 파일명 및 파일의 크기가 동일한 경우, 파일명이 다르나 파일의 크기가 동일한 경우, 파일 크기 혹은 파일 명이 다른 상태이나 적어도 하나의 블록이 동일한 경우 중 어느 하나 인 경우 유사 파일로 판단하는 것을 특징으로 한다.If the file name and the file size are the same, the file size or file name is different, or at least one block is the same, , It is determined as a similar file.

그리고 상기 생성된 유사 파일 그룹을 파일 서버에 할당하는 과정은, 각 파일 서버에 적어도 하나의 유사 파일 그룹을 할당하여 저장하는 것을 특징으로 한다.In addition, the process of assigning the generated similar file group to the file server may include assigning at least one similar file group to each file server and storing the same.

그리고 상기 생성된 유사 파일 그룹을 파일 서버에 할당하는 과정은, 상기 유사 파일 그룹을 순차적으로 할당하거나, 파일 확장자별, 파일의 종류별, 파일의 중요도별, 파일의 사용 빈도수 중 적어도 하나의 기준을 토대로 할당하는 것을 특징으로 한다.In addition, the step of allocating the generated similar file group to the file server may include sequentially allocating the similar file groups or allocating the similar file groups to the file server based on at least one of the file extension, the file type, the file importance, .

그리고 상기 중복된 파일을 제거하는 과정은, 상기 각 유사 파일 그룹 내 중복된 파일 중 기 설정된 개수를 초과하는 파일을 제거하는 것을 특징으로 한다.The removing of the duplicated files is performed by removing a file exceeding a predetermined number of duplicated files in each similar file group.

그리고 상기 중복된 파일을 제거하는 과정은, 상기 각 유사 파일 그룹 내 각 유사 파일에서 적어도 하나의 블록이 동일한 경우, 기설정된 개수를 초과하는 동일 블록을 제거하는 것을 특징으로 한다.The removing of the duplicated files is performed by removing identical blocks exceeding a predetermined number when at least one block is identical in each similar file in each similar file group.

본 발명에 있어서, 개시되는 발명 중 대표적인 것에 의하여 얻어지는 효과를 간단히 설명하면 다음과 같다.In the present invention, effects obtained by representative ones of the disclosed inventions will be briefly described as follows.

본 발명은, 분산 파일 시스템 내에 저장되어 있는 모든 파일들에 대한 분석을 통해 동일 이름으로 구성된 파일들에 대한 중복 제거를 수행하는 방식에서 벗어나 파일들의 확장자 및 콘텐츠 타입 등에 대한 분석을 통해 유사파일로 분석된 파일들을 유사 그룹으로 구성하고, 유사 그룹별로 할당된 파일서버 즉, 저장 매체 미디어에 저장하여 해당 저장 매체 미디어별로 중복 제거를 수행함으로써, 동일한 파일이라도 다른 명칭을 가진 파일들에 대한 중복 제거를 가능하게 할 수 있으며, 이를 통해 저장 매체 미디어의 저장 공간을 효율적으로 절약할 수 있는 효과가 있다.The present invention differs from the method of performing deduplication for files having the same name by analyzing all the files stored in the distributed file system and analyzing the file extensions and the content type, Files are stored in a similar group and are stored in a file server allocated to each similar group, that is, in a storage medium, and duplication is performed for each storage medium, thereby enabling duplicate removal of files having different names even if they are the same file Accordingly, the storage space of the storage medium can be efficiently saved.

도 1은 본 발명의 실시예에 따른 분산파일 시스템의 구조를 도시한 블록도,
도 2는 본 발명의 실시예에 따른 분산파일 시스템에서 중복제거 장치의 구조를 도시한 블록도,
도 3은 본 발명의 실시예에 따른 분산파일 시스템에서 중복제거 장치의 동작 절차를 도시한 흐름도.1 is a block diagram showing a structure of a distributed file system according to an embodiment of the present invention;
2 is a block diagram illustrating the structure of a deduplication apparatus in a distributed file system according to an embodiment of the present invention;
3 is a flowchart illustrating an operation procedure of a deduplication apparatus in a distributed file system according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention, and the manner of achieving them, will be apparent from and elucidated with reference to the embodiments described hereinafter in conjunction with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. To fully disclose the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims.

본 발명의 실시예들을 설명함에 있어서 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 그리고 후술되는 용어들은 본 발명의 실시예에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. The following terms are defined in consideration of the functions in the embodiments of the present invention, which may vary depending on the intention of the user, the intention or the custom of the operator. Therefore, the definition should be based on the contents throughout this specification.

첨부된 블록도의 각 블록과 흐름도의 각 단계의 조합들은 컴퓨터 프로그램 인스트럭션들에 의해 수행될 수도 있다. 이들 컴퓨터 프로그램 인스트럭션들은 범용 컴퓨터, 특수용 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서에 탑재될 수 있으므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비의 프로세서를 통해 수행되는 그 인스트럭션들이 블록도의 각 블록 또는 흐름도의 각 단계에서 설명된 기능들을 수행하는 수단을 생성하게 된다. 이들 컴퓨터 프로그램 인스트럭션들은 특정 방식으로 기능을 구현하기 위해 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 지향할 수 있는 컴퓨터 이용 가능 또는 컴퓨터 판독 가능 메모리에 저장되는 것도 가능하므로, 그 컴퓨터 이용가능 또는 컴퓨터 판독 가능 메모리에 저장된 인스트럭션들은 블록도의 각 블록 또는 흐름도 각 단계에서 설명된 기능을 수행하는 인스트럭션 수단을 내포하는 제조 품목을 생산하는 것도 가능하다. 컴퓨터 프로그램 인스트럭션들은 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에 탑재되는 것도 가능하므로, 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비 상에서 일련의 동작 단계들이 수행되어 컴퓨터로 실행되는 프로세스를 생성해서 컴퓨터 또는 기타 프로그램 가능한 데이터 프로세싱 장비를 수행하는 인스트럭션들은 블록도의 각 블록 및 흐름도의 각 단계에서 설명된 기능들을 실행하기 위한 단계들을 제공하는 것도 가능하다. Each block of the accompanying block diagrams and combinations of steps of the flowchart may be performed by computer program instructions. These computer program instructions may be loaded into a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus so that the instructions, which may be executed by a processor of a computer or other programmable data processing apparatus, And means for performing the functions described in each step are created. These computer program instructions may also be stored in a computer usable or computer readable memory capable of directing a computer or other programmable data processing apparatus to implement the functionality in a particular manner so that the computer usable or computer readable memory It is also possible for the instructions stored in the block diagram to produce a manufacturing item containing instruction means for performing the functions described in each block or flowchart of the block diagram. Computer program instructions may also be stored on a computer or other programmable data processing equipment so that a series of operating steps may be performed on a computer or other programmable data processing equipment to create a computer- It is also possible that the instructions that perform the processing equipment provide the steps for executing the functions described in each block of the block diagram and at each step of the flowchart.

또한, 각 블록 또는 각 단계는 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또, 몇 가지 대체 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들 또는 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.Also, each block or each step may represent a module, segment, or portion of code that includes one or more executable instructions for executing the specified logical function (s). It should also be noted that in some alternative embodiments, the functions mentioned in the blocks or steps may occur out of order. For example, two blocks or steps shown in succession may in fact be performed substantially concurrently, or the blocks or steps may sometimes be performed in reverse order according to the corresponding function.

본 발명의 실시예는 데이터 안정성 및 성능 향상을 위해 파일별로 복제본들을 여러 저장공간으로 분산하여 생성 및 관리하는 분산 파일 시스템에서 서로 다른 사용자들이 생성하는 파일들의 복제 본을 유사 파일 그룹으로 구성하여 이들을 각각 그룹에 할당된 파일 서버 즉, 저장 매체 미디어에 저장한 후에 해당 저장 매체 미디어별로 중복 제거를 수행하는 것이다.In an embodiment of the present invention, in order to improve data stability and performance, in a distributed file system in which replicas are distributed and stored in various storage spaces on a file-by-file basis, replicas of files generated by different users are formed into similar file groups, A file server allocated to a group, that is, a storage medium, and then performs deduplication for each storage medium.

도 1은 본 발명의 실시예에 따른 분산파일 시스템의 구조를 도시한 블록도이다.1 is a block diagram showing a structure of a distributed file system according to an embodiment of the present invention.

도 1을 참조하면, 분산파일 시스템(100)은 파일 목록 관리부(110), 파일 목록 데이터베이스(120), 중복 처리부(130), 콘텐츠 타입 데이터베이스(140), 제어부(150), 제 1 파일 서버(160)와 제 2 파일 서버(170) 내지 제 n 파일 서버(180)로 표현할 수 있는 복수의 파일 서버 등을 포함할 수 있다.1, the distributed file system 100 includes a file list management unit 110, a file list database 120, a redundancy processing unit 130, a content type database 140, a control unit 150, a first file server 160 and a second file server 170 to an nth file server 180. The file server 170 may include a plurality of file servers,

파일 목록 관리부(110)는 제 1 파일 서버(160) 내지 제 n 파일 서버(180)에 저장된 파일의 목록을 생성하고, 파일 목록 데이터베이스(120)는 파일 목록 관리부(110)에서 생성한 파일 목록을 저장한다. 예컨대, 사용자 단말기로부터 업로드 되어 제 1 파일 서버(160)에 저장되는 파일의 위치를 디렉토리의 형태로 표시할 수 있다. 이러한 파일 목록 관리부(110) 및 파일 목록 데이터베이스(120)는 실제로 파일을 저장하는 제 1 파일 서버(160) 내지 제 n 파일 서버(180)에 각각 설치하여 제어부(160)가 관리 및 운용할 수도 있다.The file list management unit 110 generates a list of files stored in the first file server 160 to the nth file server 180 and the file list database 120 stores a list of files generated by the file list management unit 110 . For example, the location of a file uploaded from the user terminal and stored in the first file server 160 may be displayed in the form of a directory. The file list management unit 110 and the file list database 120 may be respectively installed in the first file server 160 to the nth file server 180 for storing files and may be managed and operated by the controller 160 .

그리고 파일 목록 관리부(110)는 예를 들어, 주기적으로 제 1 파일 서버(160) 내지 제 n 파일 서버(180)에 저장된 파일의 목록을 갱신하고, 최종적으로 제 1 파일 서버(160) 내지 제 n 파일 서버(180)와의 동기화를 수행할 수 있다.The file list management unit 110 periodically updates the list of files stored in the first to n'th file servers 180 to 180 and updates the list of files stored in the first to n'th file servers 160 to n Synchronization with the file server 180 can be performed.

중복 처리부(103)는 제 1 파일 서버(160) 내지 제 n 파일 서버(180)에 저장된 각 파일들의 콘텐츠 타입 등에 대한 분석을 통해 유사파일로 분석된 파일들을 유사 그룹으로 구성할 수 있다. 여기서 각 파일에 대한 타입 정보로서, 파일명, 파일 확장자명, 파일 크기 및 파일 종류 등과 같은 각 파일의 기본 정보 및 메타데이터 정보 등은 콘텐츠 타입 데이터베이스(140)에 저장할 수 있다. 그리고 유사 그룹으로 설정된 파일들은 저장 매체 미디어 즉, 제 1 파일 서버(160) 내지 제 n 파일 서버(180) 중 적어도 어느 한 곳에 할당할 수 있다. 그리고 유사 그룹별로 할당된 파일 서버 별로 중복 제거를 수행할 수 있다. 중복 처리부(103)에 대해서는 도 2에서 구체적으로 설명하도록 한다.The duplication processing unit 103 may analyze files analyzed in the similar file through the analysis of the content type of each file stored in the first file server 160 to the nth file server 180 into similar groups. As the type information for each file, basic information and metadata information of each file such as a file name, a file extension name, a file size, and a file type can be stored in the content type database 140. The files set as the similar group can be allocated to the storage media, that is, the first file server 160 to the nth file server 180. And it is possible to perform deduplication for each file server allocated to each similar group. The duplication processing unit 103 will be described in detail with reference to FIG.

제어부(150)는 분산 파일 시스템(100)에 포함된 각종 구성요소를 제어하며, 파일 목록 관리부(110) 및 중복 처리부(130)와 연동하여 복수의 파일 서버들과의 파일 목록, 파일 검사, 유사 파일 간 그룹 구성 등을 제어할 수 있으며, 특히 중복 처리부(130)로부터 구성된 유사 파일 그룹별로 해당 파일들을 제 1 파일 서버(160) 내지 제 n 파일 서버(180)에 저장하고, 저장된 파일 또는 파일 블록 별로 중복 제거 등을 수행하도록 제어할 수 있다.The control unit 150 controls various components included in the distributed file system 100 and operates in conjunction with the file list management unit 110 and the redundancy processing unit 130 to perform a file list, The first file server 160 to the nth file server 180 may store corresponding files for each similar file group constructed from the redundant processing unit 130, It is possible to perform control such that duplicate elimination or the like is performed.

제 1 파일 서버(160)는 저장할 파일을 기록하는 제 1 파일 관리부(162)와 파일을 실제로 저장하는 제 1 파일 데이터베이스(164)를 포함할 수 있으며, 제 2 파일 서버(170)는 저장할 파일을 기록하는 제 2 파일 관리부(172)와 파일을 실제로 저장하는 제 2 파일 데이터베이스(174)를 포함할 수 있고, 마찬가지로 제 n 파일 서버(180)는 저장할 파일을 기록하는 제 n 파일 관리부(182)와 파일을 실제로 저장하는 제 n 파일 데이터베이스(184)를 포함할 수 있다. The first file server 160 may include a first file manager 162 for storing a file to be stored and a first file database 164 for actually storing the file. The second file server 170 may include a file And a second file database 174 for actually storing the files. Similarly, the n-th file server 180 may include an n-th file management unit 182 for recording a file to be stored, And a control file database 184 that actually stores the file.

이처럼 제 1 파일 서버(160), 제 2 파일 서버(170) 내지 제 n 파일 서버(180)로 표현할 수 있는 복수의 파일 서버는 다양한 저장 매체 미디어(예컨대, SSD(Solid State Drive), HDD(Hard Disk Drive), 광디스크 등)로 구성될 수 있으며, 파일 서버의 개수는 저장 용량, 사용자의 수, 시스템의 상태 등에 따라 다양하게 변화될 수 있다.The plurality of file servers that can be represented by the first file server 160, the second file server 170 and the nth file server 180 may be various storage media (e.g., SSD (Solid State Drive), HDD Disk drives, optical disks, etc.), and the number of file servers may be variously changed according to the storage capacity, the number of users, the state of the system, and the like.

도 2는 본 발명의 실시예에 따른 분산파일 시스템에서 중복제거 장치의 구조를 도시한 블록도이다.2 is a block diagram illustrating the structure of a deduplication apparatus in a distributed file system according to an embodiment of the present invention.

도 2를 참조하면, 중복 처리부(130)는 중복 파일 검사부(200), 유사 파일 그룹 구성부(202), 중복 제거부(204) 등을 포함할 수 있다. 중복 파일 검사부(200)는 제 1 파일 서버(160) 내지 제 n 파일 서버(180)에 저장된 각 파일의 파일명, 확장자 명, 파일 크기 및 파일 종류 등과 같은 파일의 기본 정보와 파일의 기본 정보를 포함할 수도 있으며, 파일의 위치와 내용, 작성자 정보, 권리 및 이용 조건, 이용 내역 등과 같이 파일의 속성 정보를 포함할 수 있는 메타데이터(metadata) 정보를 토대로 유사 파일인지 여부를 판단할 수 있다. 이때, 각 파일의 기본 정보 및 메타데이터 정보 등이 콘텐츠 타입 데이터베이스(140)에 저장되어 있는 경우에는 각 파일에 대한 타입 정보를 콘텐츠 타입 데이터베이스(140)로부터 검색하여 도출된 각 파일에 대한 타입 정보를 토대로 각 파일에 대한 유사 파일을 검사할 수 있다.Referring to FIG. 2, the redundancy processing unit 130 may include a redundancy file inspection unit 200, a pseudo file group construction unit 202, a redundancy removal unit 204, and the like. The duplicate file checker 200 includes file basic information such as a file name, an extension name, a file size and a file type of each file stored in the first file server 160 to the nth file server 180, And it can be determined whether the file is a similar file based on metadata information that can include attribute information of the file such as the location and content of the file, the creator information, the rights and use conditions, and the usage history. At this time, when basic information and metadata information of each file are stored in the content type database 140, type information about each file obtained by searching type information about each file from the content type database 140 Based on this, you can check for similar files for each file.

이때, 유사 파일들은 서로 간에 파일명이 동일한 경우, 파일명은 틀리나 확장자명 및 파일 크기, 파일 종류가 동일한 경우, 파일이 수정되어 파일명은 동일하나 파일 크기 및 일부 블록만 동일한 경우, 파일명 및 파일 크기가 다르나 적어도 하나 이상의 블록이 동일한 경우 등을 포함할 수 있으며, 유사 파일에 대한 검사 방식은 구현 방식에 따라 사용자가 설정 또는 분산 파일 시스템(100) 상에 적어도 하나의 기준조건으로 기 설정되어 수행될 수 있다. In this case, if the file names are the same, the file name is different, the extension name, the file size, and the file type are the same, the file name is the same, but the file name and the file size are different And at least one block may be the same, and the checking method for the similar file may be performed by the user in accordance with the implementation method, or may be preconfigured with at least one reference condition on the distributed file system 100 .

유사 파일 그룹 구성부(202)는 중복 파일 검사부(200)를 통해 중복 파일로 결정된 적어도 두 개의 파일들을 유사 파일 그룹으로 구성하고, 구성된 유사 파일 그룹별로 파일 서버에 할당할 수 있다. The pseudo file grouping unit 202 may configure at least two files determined as duplicate files into a similar file group through the duplicate file checking unit 200 and allocate them to the file server for each similar pseudo file group.

이때 그룹 구성은 파일 확장자별, 파일의 종류별, 파일의 중요도별, 파일의 사용 빈도수별 등 중 적어도 하나의 기준 방식으로 그룹을 할당하게 된다. 또한 이러한 그룹 구성은 계층별로 구성하는 것도 가능하다. 예를 들어, 유사 영화파일 그룹의 하위로 유사 AVI 파일 그룹, 유사 MKV 파일 그룹 등으로 구성되고, 유사 AVI 파일 그룹에는 유사 "영화명칭1"의 AVI 파일 그룹, 유사 "영화명칭2"의 AVI 파일 그룹 등으로 구성하게 된다.At this time, the group structure is assigned to at least one of the file formats, file types, file importance, file usage frequency, and the like. It is also possible to arrange such a group structure for each layer. For example, a similar AVI file group, a similar MKV file group, and the like are arranged as a subordinate of a similar movie file group, and an AVI file group of a similar "movie name 1" and an AVI file group of a similar "movie name 2" Group and so on.

즉, 유사 파일 그룹에 포함된 파일들은 서로 간에 적어도 하나의 블록이 동일한 파일들이므로, 이들을 유사 파일 그룹으로 구성하여 추후 중복 제거 시간을 절약할 수 있다.That is, since the files included in the similar file group are the same files, at least one of the blocks is the same file, so that they can be configured as a similar file group, thereby saving the deduplication time later.

유사 파일 그룹별 파일 서버 할당 방식은, 생성된 유사 파일 그룹을 순차적으로 할당하거나, 파일 확장자별, 파일의 종류별, 파일의 중요도별, 파일의 사용 빈도수별 등 중 적어도 하나로 설정된 기준 방식으로 파일서버를 할당하게 된다.The similar file group-specific file server allocation method is a method for allocating the similar file groups in sequence or allocating the similar file groups to the file servers in a standard manner set by at least one of file extension, file type, file importance, .

예를 들어 3개의 유사 파일 그룹이 구성된 경우에는 제 1 파일 서버(160) 내지 제 n 파일 서버(180) 각각에 유사 파일 그룹을 분산 저장시킨 후, 중복제거를 수행할 수 있다.For example, when three pseudo file groups are configured, the pseudo file groups may be distributedly stored in the first file server 160 to the nth file server 180, and then duplicate removal may be performed.

그리고 중복 제거부(204)는 각 유사 파일 그룹에 할당된 파일 서버 별로 유사 파일에 대한 중복 제거를 수행할 수 있다. 중복 제거 방식은 파일 확장자가 동일한 상태에서 파일명 및 파일의 크기가 동일한 경우, 파일명이 다르나 파일의 크기가 동일한 경우, 파일 크기 혹은 파일 명이 다른 상태이나, 적어도 하나의 블록이 동일한 경우, 기 설정된 개수 이상 동일한 해당 파일 혹은 블록을 제거하게 된다. 이는 분산 파일 시스템(100) 상에 해당 파일들이 필요에 의해 복제되어 분산 저장되어 있을 수 있으므로, 이러한 경우에는 기설정된 복제본 개수를 제어부(150)를 통해 확인할 수 있으며, 확인된 복제본 개수외에 불필요한 파일 또는 블록을 제거할 수 있다.The duplicate removal unit 204 may perform deduplication of similar files for each file server assigned to each similar file group. The duplicate removal method is a method in which the file name and the file size are the same when the file extensions are the same, the file size is the same, the file size or file name is different, or at least one block is the same, The same file or block will be removed. In this case, it is possible to check the number of pre-set replicas through the control unit 150. In addition to the number of replicas that have been checked, unnecessary files or files may be stored in the distributed file system 100, Blocks can be removed.

도 3은 본 발명의 실시예에 따른 분산파일 시스템에서 중복제거 장치의 동작 절차를 도시한 흐름도이다.3 is a flowchart illustrating an operation procedure of a deduplication apparatus in a distributed file system according to an embodiment of the present invention.

도 3을 참조하면, 300단계에서 중복 처리부(130)의 중복 파일 검사부(200)에서는 제 1 파일 서버(160) 내지 제 n 파일 서버(180) 내에 저장되어 있는 각 파일들에 대한 중복 파일 여부를 확인하게 되며, 이때 콘텐츠 타입 데이터베이스(140)에 저장되어 있는 각 파일에 대한 타입 정보를 토대로 파일 간 유사 여부를 판단하게 된다. Referring to FIG. 3, in step 300, the duplicate file checking unit 200 of the duplication processing unit 130 determines whether duplicate files are stored in the first to nth file servers 160 to 180 At this time, based on the type information of each file stored in the content type database 140, it is determined whether the files are similar or not.

예를 들어, 파일 간 유사 여부의 판단 방식은, 기본적으로 파일 확장자가 동일한 상태에서 파일명 및 파일의 크기가 동일한 경우, 파일명이 다르나 파일의 크기가 동일한 경우, 파일 크기 혹은 파일 명이 다른 상태이나, 적어도 하나의 블록이 동일한 경우 등을 판단하게 된다.For example, a method of judging whether there is a similarity between files is basically the same when file names and file sizes are the same, the file names are different, but the file sizes are the same, A case where one block is the same or the like is judged.

그리고 302단계에서 유사 파일 그룹 구성부(202)는 중복 파일 검사부(200)를 통해 유사 파일로 판단된 두 개 이상의 파일들을 그룹으로 구성하게 된다. 그리고 304단계에서 유사 파일 그룹 구성부(202)는 유사 파일 그룹별로 파일 서버를 할당하여 저장하게 된다. 이때, 파일 서버의 할당방식은 생성된 유사 파일 그룹을 순차적으로 할당하거나, 파일 확장자별, 파일의 종류별, 파일의 중요도별, 파일의 사용 빈도수별 등 중 적어도 하나의 기준 방식으로 파일서버를 할당하게 된다. In step 302, the similar file group construction unit 202 forms a group of two or more files determined as similar files through the duplicate file inspection unit 200. [ In step 304, the similar file group configuration unit 202 allocates and stores a file server for each similar file group. At this time, the allocation method of the file server allocates the file servers in at least one of the standard method of allocating the generated similar file groups sequentially, or by the file extension, the type of the file, the importance of the file, do.

그리고 306단계에서 중복 제거부(204)는 파일 서버 단위로 유사 파일 그룹 별 중복 파일을 제거하게 되며, 이때 기 설정된 기준을 토대로 파일 또는 블록을 해당 개수만큼 제거하게 된다.In step 306, the duplicate removal unit 204 deletes duplicate files for each similar file group in units of file servers, and removes the corresponding number of files or blocks based on the predetermined criteria.

이상 설명한 바와 같이, 본 발명의 실시예는 데이터 안정성 및 성능 향상을 위해 파일별로 복제본들을 여러 저장공간으로 분산하여 생성 및 관리하는 분산 파일 시스템에서 서로 다른 사용자들이 생성하는 파일들에 대한 콘텐츠 타입을 확인하여 유사 파일 그룹으로 구성하고, 이들을 각각 그룹에 할당된 파일 서버 즉, 저장 매체 미디어에 저장한 후에 해당 저장 매체 미디어별로 중복 제거를 수행한다.As described above, in the embodiment of the present invention, in order to improve the data stability and performance, in the distributed file system in which replicas are distributed and stored in a plurality of storage spaces for each file, a content type for files generated by different users is identified And stores them in a file server, that is, a storage medium, assigned to each group, and then performs deduplication for each storage medium.

한편 본 발명의 상세한 설명에서는 구체적인 실시예에 관해 설명하였으나, 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시예에 국한되지 않으며, 후술되는 특허청구의 범위뿐만 아니라 이 특허청구의 범위와 균등한 것들에 의해 정해져야 한다.While the present invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not limited to the disclosed embodiments, but is capable of various modifications within the scope of the invention. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be determined by the scope of the appended claims, and equivalents thereof.

이상과 같이, 본 발명에 따른 분산파일 시스템에서의 중복 제거 장치 및 방법은 상대적으로 저렴한 디스크 미디어를 사용함으로 발생할 수 있는 데이터 불안정성 및 성능 저하를 완화시키기 위하여 원본 파일의 복제 본을 원본 파일이 존재하는 디스크 이외의 다른 디스크/컴퓨터에 최소 1개 이상 가지게 되는 범용 하드웨어 기반 분산 파일 시스템 내에 저장되어 있는 모든 파일들에 대한 분석을 통해 동일 이름으로 구성된 파일들에 대한 중복 제거를 수행하는 방식에서 벗어나 As described above, the apparatus and method for deduplication in the distributed file system according to the present invention can reduce the data instability and performance degradation caused by using a relatively inexpensive disk media. By analyzing all the files stored in a general-purpose hardware-based distributed file system that has at least one disk / computer other than the disk, the system removes redundant removal of files composed of the same name

분산 파일 시스템에서 파일들의 콘텐츠 타입 등에 대한 분석을 통해 유사파일로 분석된 파일들을 유사 그룹으로 구성하고, 유사 그룹별로 할당된 파일 서버와 같은 저장 매체 미디어에 저장하여 해당 저장 매체 미디어별로 중복 제거를 수행함으로써, 동일한 파일이라도 다른 명칭을 가진 파일들에 대한 중복 제거를 가능하게 할 수 있으며, 이를 통해 저장 매체 미디어의 저장 공간의 사용량을 줄이기 위한 것에 적합하다.By analyzing the contents type of files in the distributed file system, the files analyzed by similar files are organized into similar groups and stored in a storage medium such as a file server allocated for each similar group, Thus, it is possible to reduce duplication of files having different names even if the same file is used, which is suitable for reducing the storage space of the storage medium.

100 : 분산 파일 시스템 110 : 파일 목록 관리부
120 : 파일 목록 데이터베이스 130 : 중복 처리부
140 : 콘텐츠 타입 데이터베이스 160 : 제어부
160 : 제 1 파일 서버 162 : 제 1 파일 관리부
164 : 제 1 파일 데이터베이스 170 : 제 2 파일 서버
172 : 제 2 파일 관리부 174 : 제 2 파일 데이터베이스
180 : 제 n 파일 서버 182 : 제 n 파일 관리부
184 : 제 n 파일 데이터베이스100: Distributed file system 110: File list management unit
120: file list database 130: redundant processing section
140: content type database 160:
160: first file server 162: first file manager
164: first file database 170: second file server
172: second file management unit 174: second file database
180: nth file server 182: control file manager
184: Control File Database

Claims

A duplicate file checking unit for determining similarity based on type information between the files stored in the plurality of file servers,
A similar file group constituent unit for generating judged similar files in a group and assigning similar files in the same similar file among the generated similar file groups to the same file server,
A redundancy elimination unit for judging existence of redundant files or duplicated file blocks in each similar file group for each file server based on the similarity determined by the redundancy file checking unit,
And a deduplication unit for deduplicating the distributed file system.

The method according to claim 1,
The type information includes:
A file name, a file extension name, a file size, a file type, and metadata information.

The method according to claim 1,
The duplicate file checking unit checks,
When the file name and the file size are the same, the file name is different but the file size is the same, the file size or the file name is different, or at least one block is the same. A deduplication device in a distributed file system.

A duplicate file checking unit for determining similarity based on type information between the files stored in the plurality of file servers,
A similar file group construction unit for generating the similar files judged as a group and assigning the generated similar file group to the file server,
And a duplicate removal unit for removing duplicated files in each similar file group based on the similarity determined through the duplication file checking unit
Lt; / RTI >
Wherein the similar file group constituent unit comprises:
Wherein the group is constituted by at least one reference method among file extensions, file types, file importance, and file usage frequencies.

A duplicate file checking unit for determining similarity based on type information between the files stored in the plurality of file servers,
A similar file group construction unit for generating the similar files judged as a group and assigning the generated similar file group to the file server,
And a duplicate removal unit for removing duplicated files in each similar file group based on the similarity determined through the duplication file checking unit
Lt; / RTI >
Wherein the similar file group constituent unit comprises:
Wherein at least one similar file group is allocated to each file server and stored.

A duplicate file checking unit for determining similarity based on type information between the files stored in the plurality of file servers,
A similar file group construction unit for generating the similar files judged as a group and assigning the generated similar file group to the file server,
And a duplicate removal unit for removing duplicated files in each similar file group based on the similarity determined through the duplication file checking unit
Lt; / RTI >
Wherein the de-
And removing a file exceeding a predetermined number of duplicated files in each of the similar file groups.

A duplicate file checking unit for determining similarity based on type information between the files stored in the plurality of file servers,
A similar file group construction unit for generating the similar files judged as a group and assigning the generated similar file group to the file server,
And a duplicate removal unit for removing duplicated files in each similar file group based on the similarity determined through the duplication file checking unit
Lt; / RTI >
Wherein the de-
And removing the same block exceeding a predetermined number when at least one block is identical in each pseudo file in each pseudo file group.

Determining similarity based on type information of each file distributed and stored in a plurality of file servers in a duplicate file checking unit;
Creating a group of files determined to be similar through the duplicate file checking unit in the similar file grouping unit and assigning similar files in the same similar file group among the generated similar file groups to the same file server;
Determining whether there is a duplicated file or a duplicated file block in each similar file group for each file server based on the similarity determined through the duplicate file checking unit in the duplicate removal unit,
A method for deduplication in a distributed file system.

9. The method of claim 8,
The type information includes:
A file name, a file extension name, a file size, a file type, and metadata information.

9. The method of claim 8,
The process of determining similarity includes:
When the file name and the file size are the same, the file name is different but the file size is the same, the file size or the file name is different, or at least one block is the same. A method of deduplication in a distributed file system.

Determining similarity based on type information of each file distributed and stored in a plurality of file servers in a duplicate file checking unit;
Creating a group of files determined to be similar through the duplicate file checking unit in the similar file grouping unit and assigning the generated similar file group to the file server;
Removing duplicated files in each similar file group on the basis of similarity determined through the duplicate file checking unit in the duplicate removal unit
/ RTI >
Wherein the step of assigning the generated similar file group to the file server comprises:
Wherein at least one similar file group is allocated to each file server and stored.

Determining similarity based on type information of each file distributed and stored in a plurality of file servers in a duplicate file checking unit;
Creating a group of files determined to be similar through the duplicate file checking unit in the similar file grouping unit and assigning the generated similar file group to the file server;
Removing duplicated files in each similar file group on the basis of similarity determined through the duplicate file checking unit in the duplicate removal unit
/ RTI >
Wherein the step of assigning the generated similar file group to the file server comprises:
Wherein the similar file groups are sequentially allocated or allocated based on at least one of a file extension, a file type, a file importance, and a file usage frequency.

Determining similarity based on type information of each file distributed and stored in a plurality of file servers in a duplicate file checking unit;
Creating a group of files determined to be similar through the duplicate file checking unit in the similar file grouping unit and assigning the generated similar file group to the file server;
Removing duplicated files in each similar file group on the basis of similarity determined through the duplicate file checking unit in the duplicate removal unit
/ RTI >
Wherein the step of removing the duplicate file comprises:
And removing a file exceeding a preset number of duplicated files in each pseudo file group.

Determining similarity based on type information of each file distributed and stored in a plurality of file servers in a duplicate file checking unit;
Creating a group of files determined to be similar through the duplicate file checking unit in the similar file grouping unit and assigning the generated similar file group to the file server;
Removing duplicated files in each similar file group on the basis of similarity determined through the duplicate file checking unit in the duplicate removal unit
/ RTI >
Wherein the step of removing the duplicate file comprises:
Wherein the same block exceeding a predetermined number of blocks is removed if at least one block is identical in each pseudo file in each pseudo file group.