CN103617177A - Stackable repeating data deletion file system - Google Patents

Stackable repeating data deletion file system Download PDF

Info

Publication number
CN103617177A
CN103617177A CN201310541623.5A CN201310541623A CN103617177A CN 103617177 A CN103617177 A CN 103617177A CN 201310541623 A CN201310541623 A CN 201310541623A CN 103617177 A CN103617177 A CN 103617177A
Authority
CN
China
Prior art keywords
data
file system
deduplication
service module
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310541623.5A
Other languages
Chinese (zh)
Inventor
王恩东
文中领
张立强
孟圣智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Beijing Electronic Information Industry Co Ltd
Original Assignee
Inspur Beijing Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Beijing Electronic Information Industry Co Ltd filed Critical Inspur Beijing Electronic Information Industry Co Ltd
Priority to CN201310541623.5A priority Critical patent/CN103617177A/en
Publication of CN103617177A publication Critical patent/CN103617177A/en
Priority to PCT/CN2014/089303 priority patent/WO2015067128A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Storage Device Security (AREA)

Abstract

提出一种堆叠式重复数据删除文件系统,包括文件系统服务模块,对于正常的数据,采用直接接口转换的方式将底层文件系统的数据导入本文件系统中;对于进行了重复数据删除的数据,读取相应的数据属性标识,进行IO流程的重定向,实现重删后数据的透明无缝访问;重删服务模块,读取文件系统服务模块导出的文件系统日志数据,解析日志内容后进行数据签名的计算、重复数据的检测和删除,完成重删后对数据进行标识。所述系统能够充分利用已有存储系统的存储能力,无需升级硬件最大限度地节省投资,通过堆叠式的软件设计,在已有的文件系统上提供重复数据删除功能,优化数据存储结构,降低存储系统的空间占用。

Figure 201310541623

A stacked data deduplication file system is proposed, including the file system service module. For normal data, the data of the underlying file system is imported into the file system by direct interface conversion; for the deduplicated data, read Take the corresponding data attribute identifier, redirect the IO process, and realize transparent and seamless access to the data after deduplication; the deduplication service module reads the file system log data exported by the file system service module, and performs data signature after parsing the log content The calculation, the detection and deletion of duplicate data, and the identification of the data after deduplication is completed. The system can make full use of the storage capacity of the existing storage system without upgrading the hardware to save investment to the greatest extent. Through the stacked software design, it provides the deduplication function on the existing file system, optimizes the data storage structure, and reduces the storage cost. The space occupied by the system.

Figure 201310541623

Description

一种堆叠式重复数据删除文件系统A Stacked Deduplication File System

技术领域technical field

本发明涉及计算机存储领域,具体涉及一种基于堆叠式文件系统技术实现的重复数据删除文件系统。The invention relates to the field of computer storage, in particular to a data deduplication file system based on stacked file system technology.

背景技术Background technique

在大型存储系统中,数据急速增长与存储设备升级相对缓慢的矛盾较为尖锐,为了缓解存储系统的空间增长问题,缩减数据占用的空间,降低成本,最大化利用已有资源,重复数据删除技术已经成为大型系统中必不可少的关键技术。In a large-scale storage system, the contradiction between the rapid growth of data and the relatively slow upgrade of storage devices is relatively acute. In order to alleviate the problem of space growth in the storage system, reduce the space occupied by data, reduce costs, and maximize the use of existing resources, deduplication technology has been adopted. Become an indispensable key technology in large-scale systems.

通过使用重复数据删除技术,用户可以获得明显的数据缩减效果,可以大大降低存储系统的带宽需求,降低运营成本和维护成本。通过数据缩减使得后端实际的存储容量大大缩减,由此带来了更简洁的存储管理,有效降低了管理成本。By using the data deduplication technology, users can obtain obvious data reduction effects, which can greatly reduce the bandwidth requirements of the storage system, and reduce operating and maintenance costs. Through data reduction, the actual storage capacity of the backend is greatly reduced, which brings simpler storage management and effectively reduces management costs.

然而目前流行的重复数据删除方案,多为面向近线存储和备份存储的重删方案,而且往往与备份系统紧密结合,因而无法提供一般性的文件系统服务。能够在在线系统中直接提供重复数据删除功能的产品较少,且均需要使用专有的文件系统格式,这些专有的文件系统往往在性能、功能、可靠性、可扩展性方面均存在诸多限制,使得在大型在线存储系统中直接应用存在一定困难。However, currently popular data deduplication solutions are mostly for nearline storage and backup storage, and are often closely integrated with backup systems, so they cannot provide general file system services. There are few products that can directly provide the deduplication function in the online system, and all of them need to use a proprietary file system format. These proprietary file systems often have many limitations in terms of performance, function, reliability, and scalability , making it difficult to apply directly in large-scale online storage systems.

已有的大型存储系统往往基于成熟的文件系统构建,如ext3、ext4、xfs、lustre等,这类文件系统本身并不具备重复数据删除的功能,而如果要使用重复数据删除功能,则面临着需要使用专有的文件系统,忍受明显可感知的性能降低,并进行大规模的数据迁移,这带来极高的时间和空间成本,在已经有大量数据的存储系统中,基本上没有可行性,成本过高。Existing large-scale storage systems are often built based on mature file systems, such as ext3, ext4, xfs, lustre, etc. These file systems do not have the function of deduplication, and if you want to use the function of deduplication, you will face It is necessary to use a proprietary file system, endure obvious perceptible performance degradation, and perform large-scale data migration, which brings extremely high time and space costs, and is basically not feasible in a storage system that already has a large amount of data , the cost is too high.

针对这一现状,本发明设计了一种堆叠式重复数据删除文件系统,能够基于已有的成熟的文件系统提供重复数据删除功能,充分保持原有存储系统的性能,同时几乎不需要进行任何数据迁移。In view of this situation, the present invention designs a stacked data deduplication file system, which can provide deduplication function based on the existing mature file system, fully maintain the performance of the original storage system, and hardly need any data migrate.

发明内容Contents of the invention

本发明设计并实现了一种堆叠式重复数据删除文件系统,能够充分利用已有存储系统的存储能力,无需升级硬件最大限度地节省投资,通过堆叠式的软件设计,在已有的文件系统上提供重复数据删除功能,优化数据存储结构,降低存储系统的空间占用。The present invention designs and implements a stacked data deduplication file system, which can make full use of the storage capacity of the existing storage system and save investment to the greatest extent without upgrading the hardware. Through the stacked software design, on the existing file system Provides the deduplication function, optimizes the data storage structure, and reduces the space occupied by the storage system.

所述系统包括:The system includes:

文件系统服务模块,对于正常的数据,采用直接接口转换的方式将底层文件系统的数据导入本文件系统中;对于进行了重复数据删除的数据,读取相应的数据属性标识,进行IO流程的重定向,实现重删后数据的透明无缝访问;The file system service module, for normal data, imports the data of the underlying file system into this file system by means of direct interface conversion; for the data that has been deduplicated, reads the corresponding data attribute identification, and performs IO process re- Orientation, to achieve transparent and seamless access to data after deduplication;

重删服务模块,读取文件系统服务模块导出的文件系统日志数据,解析日志内容后进行数据签名的计算、重复数据的检测和删除,完成重删后对数据进行标识。The deduplication service module reads the file system log data exported by the file system service module, calculates the data signature, detects and deletes duplicate data after parsing the log content, and identifies the data after deduplication is completed.

本发明的有益效果是:基于堆叠式文件系统的设计可以充分利用现有的存储系统,仅通过安装本专利描述的软件系统即可使已有的文件系统支持重复数据删除功能以节省存储空间,无需迁移数据,同时保持了原有存储系统的IO性能,实现充分的设备利旧和投资保护。The beneficial effects of the present invention are: the design based on the stacked file system can make full use of the existing storage system, and only by installing the software system described in this patent, the existing file system can support the deduplication function to save storage space, There is no need to migrate data, while maintaining the IO performance of the original storage system, achieving full equipment recycling and investment protection.

附图说明Description of drawings

附图1为本专利所提出的堆叠式重复数据删除文件系统的架构示意图。Accompanying drawing 1 is a schematic diagram of the architecture of the stacked data deduplication file system proposed in this patent.

具体实施方式Detailed ways

下面参照附图1,对本发明的内容以一个具体实例来描述实现这一体系结构的过程。Referring to accompanying drawing 1 below, the content of the present invention is described the process of realizing this system structure with a specific example.

正如发明内容中所描述的,本发明体系结构主要包括:文件系统服务模块、重删服务模块。As described in the summary of the invention, the architecture of the present invention mainly includes: a file system service module and a deduplication service module.

文件系统服务模块实现了一个完整支持POSIX协议的文件系统,其采用了堆叠式文件系统的设计策略,通过在文件系统接口层的映射和重写,将底层文件系统的服务完整实现。对于正常的数据,本模块采用直接接口转换的方式将底层文件系统的数据导入本文件系统中,实现了正常数据的无缝访问。对于进行了重复数据删除的数据,本模块根据本发明所描述的文件系统的约定,读取相应的数据属性标识,进行IO流程的重定向,实现重删后数据的透明无缝访问。The file system service module implements a file system that fully supports the POSIX protocol. It adopts the design strategy of a stacked file system, and fully realizes the services of the underlying file system through mapping and rewriting at the file system interface layer. For normal data, this module imports the data of the underlying file system into this file system by means of direct interface conversion, realizing the seamless access of normal data. For the deduplicated data, this module reads the corresponding data attribute identification according to the agreement of the file system described in the present invention, redirects the IO process, and realizes transparent and seamless access to the deduplicated data.

重删服务模块在带外独立运行,其采用多线程设计,充分利用多核系统的并行计算能力,提供超高速的重复数据删除功能。本模块读取文件系统服务模块导出的文件系统日志数据,解析日志内容后进行数据签名的计算、重复数据的检测和删除,完成重删后对数据进行标识。本模块可与文件系统服务模块同时运行,通过文件系统服务模块内设计的细粒度锁,保证数据处理的原子性,提供可靠的并行数据处理能力。The deduplication service module runs independently out-of-band. It adopts multi-thread design, fully utilizes the parallel computing capability of the multi-core system, and provides ultra-high-speed deduplication function. This module reads the file system log data exported by the file system service module, analyzes the log content, calculates the data signature, detects and deletes duplicate data, and identifies the data after deduplication. This module can run simultaneously with the file system service module, and through the fine-grained lock designed in the file system service module, the atomicity of data processing is guaranteed and reliable parallel data processing capabilities are provided.

在一个典型的配置环境里,文件系统服务模块、重删服务模块可作为一般应用软件安装到主机系统中。在进行了相关的软件配置后,可启动文件系统服务模块、重删服务模块,此时已经能够在主机上挂载本发明描述的文件系统,并能够进行数据访问。在一段时间的文件系统IO完成后,重删服务模块能够自动地进行数据签名的计算,并根据配置参数进行重复数据的检测和删除,并完成重删后数据的标记。In a typical configuration environment, the file system service module and deduplication service module can be installed in the host system as general application software. After the relevant software configuration is carried out, the file system service module and the deduplication service module can be started. At this time, the file system described in the present invention can be mounted on the host computer and data access can be performed. After a period of file system IO is completed, the deduplication service module can automatically calculate the data signature, detect and delete duplicate data according to the configuration parameters, and complete the marking of the deduplicated data.

至此,已经完整实现了整个堆叠式重复数据删除文件系统,实现了在已有文件系统上提供高性能重复数据删除服务的功能,极大的提高了存储系统的空间利用率,有效保护了客户投资。So far, the entire stacked deduplication file system has been fully realized, and the function of providing high-performance deduplication service on the existing file system has been realized, which greatly improves the space utilization rate of the storage system and effectively protects customer investment. .

当然,本发明还可有其他多种实施例,在不背离本发明精神及其实质的情况下,熟悉本领域的技术人员当可根据本发明作出各种相应的改变和变形,但这些相应的改变和变形都应属于本发明的权利要求的保护范围。Of course, the present invention can also have other various embodiments, and those skilled in the art can make various corresponding changes and deformations according to the present invention without departing from the spirit and essence of the present invention, but these corresponding Changes and deformations should all belong to the protection scope of the claims of the present invention.

Claims (1)

1. a stack data de-duplication file system, is characterized in that comprising:
File system service module, for normal data, adopts the mode of direct interface conversion by the data importing presents system of bottom document system; For the data of having carried out data de-duplication, read corresponding data attribute sign, carry out being redirected of IO flow process, realize the transparent seamless access of heavily deleting rear data;
Heavily delete service module, the file system journal data that file reading system service module derives, resolve the calculating of the laggard row data signature of log content, detection and the deletion of repeating data, complete heavily to delete rear data to be identified.
CN201310541623.5A 2013-11-05 2013-11-05 Stackable repeating data deletion file system Pending CN103617177A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201310541623.5A CN103617177A (en) 2013-11-05 2013-11-05 Stackable repeating data deletion file system
PCT/CN2014/089303 WO2015067128A1 (en) 2013-11-05 2014-10-23 Stackable data duplication file system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310541623.5A CN103617177A (en) 2013-11-05 2013-11-05 Stackable repeating data deletion file system

Publications (1)

Publication Number Publication Date
CN103617177A true CN103617177A (en) 2014-03-05

Family

ID=50167880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310541623.5A Pending CN103617177A (en) 2013-11-05 2013-11-05 Stackable repeating data deletion file system

Country Status (2)

Country Link
CN (1) CN103617177A (en)
WO (1) WO2015067128A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104133888A (en) * 2014-07-30 2014-11-05 宇龙计算机通信科技(深圳)有限公司 Multi-system data processing method, device and terminal
CN104391915A (en) * 2014-11-19 2015-03-04 湖南国科微电子有限公司 Duplicated data delete method
WO2015067128A1 (en) * 2013-11-05 2015-05-14 浪潮(北京)电子信息产业有限公司 Stackable data duplication file system
CN105205094A (en) * 2015-08-12 2015-12-30 浪潮(北京)电子信息产业有限公司 Multi-control share storage system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100082700A1 (en) * 2008-09-22 2010-04-01 Riverbed Technology, Inc. Storage system for data virtualization and deduplication
US20100082547A1 (en) * 2008-09-22 2010-04-01 Riverbed Technology, Inc. Log Structured Content Addressable Deduplicating Storage
CN101908073A (en) * 2010-08-13 2010-12-08 清华大学 A method for real-time deletion of duplicate data in a file system
CN103051671A (en) * 2012-11-22 2013-04-17 浪潮电子信息产业股份有限公司 Repeating data deletion method for cluster file system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0104227D0 (en) * 2001-02-21 2001-04-11 Ibm Information component based data storage and management
CN103279502B (en) * 2013-05-06 2016-01-20 北京赛思信安技术有限公司 A kind of framework and method with the data de-duplication file system be combined with parallel file system
CN103617177A (en) * 2013-11-05 2014-03-05 浪潮(北京)电子信息产业有限公司 Stackable repeating data deletion file system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100082700A1 (en) * 2008-09-22 2010-04-01 Riverbed Technology, Inc. Storage system for data virtualization and deduplication
US20100082547A1 (en) * 2008-09-22 2010-04-01 Riverbed Technology, Inc. Log Structured Content Addressable Deduplicating Storage
CN101908073A (en) * 2010-08-13 2010-12-08 清华大学 A method for real-time deletion of duplicate data in a file system
CN103051671A (en) * 2012-11-22 2013-04-17 浪潮电子信息产业股份有限公司 Repeating data deletion method for cluster file system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015067128A1 (en) * 2013-11-05 2015-05-14 浪潮(北京)电子信息产业有限公司 Stackable data duplication file system
CN104133888A (en) * 2014-07-30 2014-11-05 宇龙计算机通信科技(深圳)有限公司 Multi-system data processing method, device and terminal
CN104133888B (en) * 2014-07-30 2019-08-02 宇龙计算机通信科技(深圳)有限公司 A kind of multisystem data processing method, device and terminal
CN104391915A (en) * 2014-11-19 2015-03-04 湖南国科微电子有限公司 Duplicated data delete method
CN104391915B (en) * 2014-11-19 2016-02-24 湖南国科微电子股份有限公司 A kind of data heavily delete method
CN105205094A (en) * 2015-08-12 2015-12-30 浪潮(北京)电子信息产业有限公司 Multi-control share storage system

Also Published As

Publication number Publication date
WO2015067128A1 (en) 2015-05-14

Similar Documents

Publication Publication Date Title
CN102629258B (en) Repeating data deleting method and device
CN102662992B (en) Method and device for storing and accessing massive small files
US9729659B2 (en) Caching content addressable data chunks for storage virtualization
CN103229173B (en) Metadata management method and system
CN104145468B (en) Method and device for controlling file access authority
CN101866359B (en) Small file storage and visit method in avicade file system
CN105183839A (en) Hadoop-based storage optimizing method for small file hierachical indexing
CN104462185B (en) A kind of digital library's cloud storage system based on mixed structure
CN106909651A (en) A kind of method for being write based on HDFS small documents and being read
CN103561101A (en) Network file system
CN105487818A (en) Efficient duplicate removal method for repeated redundant data in cloud storage system
CN103020174A (en) Similarity analysis method, device and system
CN103279502B (en) A kind of framework and method with the data de-duplication file system be combined with parallel file system
CN103034684A (en) Optimizing method for storing virtual machine mirror images based on CAS (content addressable storage)
CN103078898B (en) File system, interface service device and data storage service supplying method
CN105630810B (en) A method of mass small documents are uploaded in distributed memory system
CN104778229A (en) Telecommunication service small file storage system and method based on Hadoop
CN103595799A (en) Method for achieving distributed shared data bank
CN103617177A (en) Stackable repeating data deletion file system
WO2021082928A1 (en) Data reduction method and apparatus, computing device, and storage medium
CN102722450B (en) Storage method for redundancy deletion block device based on location-sensitive hash
CN105516313A (en) Distributed storage system used for big data
CN102566942A (en) File striping writing method, device and system
CN103984507A (en) Storage configuration and optimizing strategy for bioinformatics high-performance computing platform
CN103543959B (en) The method and device of mass data cache

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140305

WD01 Invention patent application deemed withdrawn after publication