CN103617177A

CN103617177A - Stackable repeating data deletion file system

Info

Publication number: CN103617177A
Application number: CN201310541623.5A
Authority: CN
Inventors: 王恩东; 文中领; 张立强; 孟圣智
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2013-11-05
Filing date: 2013-11-05
Publication date: 2014-03-05
Also published as: WO2015067128A1

Abstract

A stacked data deduplication file system is proposed, including the file system service module. For normal data, the data of the underlying file system is imported into the file system by direct interface conversion; for the deduplicated data, read Take the corresponding data attribute identifier, redirect the IO process, and realize transparent and seamless access to the data after deduplication; the deduplication service module reads the file system log data exported by the file system service module, and performs data signature after parsing the log content The calculation, the detection and deletion of duplicate data, and the identification of the data after deduplication is completed. The system can make full use of the storage capacity of the existing storage system without upgrading the hardware to save investment to the greatest extent. Through the stacked software design, it provides the deduplication function on the existing file system, optimizes the data storage structure, and reduces the storage cost. The space occupied by the system.

Description

A Stacked Deduplication File System

技术领域technical field

本发明涉及计算机存储领域，具体涉及一种基于堆叠式文件系统技术实现的重复数据删除文件系统。The invention relates to the field of computer storage, in particular to a data deduplication file system based on stacked file system technology.

背景技术Background technique

在大型存储系统中，数据急速增长与存储设备升级相对缓慢的矛盾较为尖锐，为了缓解存储系统的空间增长问题，缩减数据占用的空间，降低成本，最大化利用已有资源，重复数据删除技术已经成为大型系统中必不可少的关键技术。In a large-scale storage system, the contradiction between the rapid growth of data and the relatively slow upgrade of storage devices is relatively acute. In order to alleviate the problem of space growth in the storage system, reduce the space occupied by data, reduce costs, and maximize the use of existing resources, deduplication technology has been adopted. Become an indispensable key technology in large-scale systems.

通过使用重复数据删除技术，用户可以获得明显的数据缩减效果，可以大大降低存储系统的带宽需求，降低运营成本和维护成本。通过数据缩减使得后端实际的存储容量大大缩减，由此带来了更简洁的存储管理，有效降低了管理成本。By using the data deduplication technology, users can obtain obvious data reduction effects, which can greatly reduce the bandwidth requirements of the storage system, and reduce operating and maintenance costs. Through data reduction, the actual storage capacity of the backend is greatly reduced, which brings simpler storage management and effectively reduces management costs.

然而目前流行的重复数据删除方案，多为面向近线存储和备份存储的重删方案，而且往往与备份系统紧密结合，因而无法提供一般性的文件系统服务。能够在在线系统中直接提供重复数据删除功能的产品较少，且均需要使用专有的文件系统格式，这些专有的文件系统往往在性能、功能、可靠性、可扩展性方面均存在诸多限制，使得在大型在线存储系统中直接应用存在一定困难。However, currently popular data deduplication solutions are mostly for nearline storage and backup storage, and are often closely integrated with backup systems, so they cannot provide general file system services. There are few products that can directly provide the deduplication function in the online system, and all of them need to use a proprietary file system format. These proprietary file systems often have many limitations in terms of performance, function, reliability, and scalability , making it difficult to apply directly in large-scale online storage systems.

已有的大型存储系统往往基于成熟的文件系统构建，如ext3、ext4、xfs、lustre等，这类文件系统本身并不具备重复数据删除的功能，而如果要使用重复数据删除功能，则面临着需要使用专有的文件系统，忍受明显可感知的性能降低，并进行大规模的数据迁移，这带来极高的时间和空间成本，在已经有大量数据的存储系统中，基本上没有可行性，成本过高。Existing large-scale storage systems are often built based on mature file systems, such as ext3, ext4, xfs, lustre, etc. These file systems do not have the function of deduplication, and if you want to use the function of deduplication, you will face It is necessary to use a proprietary file system, endure obvious perceptible performance degradation, and perform large-scale data migration, which brings extremely high time and space costs, and is basically not feasible in a storage system that already has a large amount of data , the cost is too high.

针对这一现状，本发明设计了一种堆叠式重复数据删除文件系统，能够基于已有的成熟的文件系统提供重复数据删除功能，充分保持原有存储系统的性能，同时几乎不需要进行任何数据迁移。In view of this situation, the present invention designs a stacked data deduplication file system, which can provide deduplication function based on the existing mature file system, fully maintain the performance of the original storage system, and hardly need any data migrate.

发明内容Contents of the invention

本发明设计并实现了一种堆叠式重复数据删除文件系统，能够充分利用已有存储系统的存储能力，无需升级硬件最大限度地节省投资，通过堆叠式的软件设计，在已有的文件系统上提供重复数据删除功能，优化数据存储结构，降低存储系统的空间占用。The present invention designs and implements a stacked data deduplication file system, which can make full use of the storage capacity of the existing storage system and save investment to the greatest extent without upgrading the hardware. Through the stacked software design, on the existing file system Provides the deduplication function, optimizes the data storage structure, and reduces the space occupied by the storage system.

所述系统包括：The system includes:

文件系统服务模块，对于正常的数据，采用直接接口转换的方式将底层文件系统的数据导入本文件系统中；对于进行了重复数据删除的数据，读取相应的数据属性标识，进行IO流程的重定向，实现重删后数据的透明无缝访问；The file system service module, for normal data, imports the data of the underlying file system into this file system by means of direct interface conversion; for the data that has been deduplicated, reads the corresponding data attribute identification, and performs IO process re- Orientation, to achieve transparent and seamless access to data after deduplication;

重删服务模块，读取文件系统服务模块导出的文件系统日志数据，解析日志内容后进行数据签名的计算、重复数据的检测和删除，完成重删后对数据进行标识。The deduplication service module reads the file system log data exported by the file system service module, calculates the data signature, detects and deletes duplicate data after parsing the log content, and identifies the data after deduplication is completed.

本发明的有益效果是：基于堆叠式文件系统的设计可以充分利用现有的存储系统，仅通过安装本专利描述的软件系统即可使已有的文件系统支持重复数据删除功能以节省存储空间，无需迁移数据，同时保持了原有存储系统的IO性能，实现充分的设备利旧和投资保护。The beneficial effects of the present invention are: the design based on the stacked file system can make full use of the existing storage system, and only by installing the software system described in this patent, the existing file system can support the deduplication function to save storage space, There is no need to migrate data, while maintaining the IO performance of the original storage system, achieving full equipment recycling and investment protection.

附图说明Description of drawings

附图1为本专利所提出的堆叠式重复数据删除文件系统的架构示意图。Accompanying drawing 1 is a schematic diagram of the architecture of the stacked data deduplication file system proposed in this patent.

具体实施方式Detailed ways

下面参照附图1，对本发明的内容以一个具体实例来描述实现这一体系结构的过程。Referring to accompanying drawing 1 below, the content of the present invention is described the process of realizing this system structure with a specific example.

正如发明内容中所描述的，本发明体系结构主要包括：文件系统服务模块、重删服务模块。As described in the summary of the invention, the architecture of the present invention mainly includes: a file system service module and a deduplication service module.

文件系统服务模块实现了一个完整支持POSIX协议的文件系统，其采用了堆叠式文件系统的设计策略，通过在文件系统接口层的映射和重写，将底层文件系统的服务完整实现。对于正常的数据，本模块采用直接接口转换的方式将底层文件系统的数据导入本文件系统中，实现了正常数据的无缝访问。对于进行了重复数据删除的数据，本模块根据本发明所描述的文件系统的约定，读取相应的数据属性标识，进行IO流程的重定向，实现重删后数据的透明无缝访问。The file system service module implements a file system that fully supports the POSIX protocol. It adopts the design strategy of a stacked file system, and fully realizes the services of the underlying file system through mapping and rewriting at the file system interface layer. For normal data, this module imports the data of the underlying file system into this file system by means of direct interface conversion, realizing the seamless access of normal data. For the deduplicated data, this module reads the corresponding data attribute identification according to the agreement of the file system described in the present invention, redirects the IO process, and realizes transparent and seamless access to the deduplicated data.

重删服务模块在带外独立运行，其采用多线程设计，充分利用多核系统的并行计算能力，提供超高速的重复数据删除功能。本模块读取文件系统服务模块导出的文件系统日志数据，解析日志内容后进行数据签名的计算、重复数据的检测和删除，完成重删后对数据进行标识。本模块可与文件系统服务模块同时运行，通过文件系统服务模块内设计的细粒度锁，保证数据处理的原子性，提供可靠的并行数据处理能力。The deduplication service module runs independently out-of-band. It adopts multi-thread design, fully utilizes the parallel computing capability of the multi-core system, and provides ultra-high-speed deduplication function. This module reads the file system log data exported by the file system service module, analyzes the log content, calculates the data signature, detects and deletes duplicate data, and identifies the data after deduplication. This module can run simultaneously with the file system service module, and through the fine-grained lock designed in the file system service module, the atomicity of data processing is guaranteed and reliable parallel data processing capabilities are provided.

在一个典型的配置环境里，文件系统服务模块、重删服务模块可作为一般应用软件安装到主机系统中。在进行了相关的软件配置后，可启动文件系统服务模块、重删服务模块，此时已经能够在主机上挂载本发明描述的文件系统，并能够进行数据访问。在一段时间的文件系统IO完成后，重删服务模块能够自动地进行数据签名的计算，并根据配置参数进行重复数据的检测和删除，并完成重删后数据的标记。In a typical configuration environment, the file system service module and deduplication service module can be installed in the host system as general application software. After the relevant software configuration is carried out, the file system service module and the deduplication service module can be started. At this time, the file system described in the present invention can be mounted on the host computer and data access can be performed. After a period of file system IO is completed, the deduplication service module can automatically calculate the data signature, detect and delete duplicate data according to the configuration parameters, and complete the marking of the deduplicated data.

至此，已经完整实现了整个堆叠式重复数据删除文件系统，实现了在已有文件系统上提供高性能重复数据删除服务的功能，极大的提高了存储系统的空间利用率，有效保护了客户投资。So far, the entire stacked deduplication file system has been fully realized, and the function of providing high-performance deduplication service on the existing file system has been realized, which greatly improves the space utilization rate of the storage system and effectively protects customer investment. .

当然，本发明还可有其他多种实施例，在不背离本发明精神及其实质的情况下，熟悉本领域的技术人员当可根据本发明作出各种相应的改变和变形，但这些相应的改变和变形都应属于本发明的权利要求的保护范围。Of course, the present invention can also have other various embodiments, and those skilled in the art can make various corresponding changes and deformations according to the present invention without departing from the spirit and essence of the present invention, but these corresponding Changes and deformations should all belong to the protection scope of the claims of the present invention.

Claims

1. a stack data de-duplication file system, is characterized in that comprising:

File system service module, for normal data, adopts the mode of direct interface conversion by the data importing presents system of bottom document system; For the data of having carried out data de-duplication, read corresponding data attribute sign, carry out being redirected of IO flow process, realize the transparent seamless access of heavily deleting rear data;

Heavily delete service module, the file system journal data that file reading system service module derives, resolve the calculating of the laggard row data signature of log content, detection and the deletion of repeating data, complete heavily to delete rear data to be identified.