WO2012171244A1 - 块级虚拟化存储设备上实现重复数据删除的方法及系统 - Google Patents

块级虚拟化存储设备上实现重复数据删除的方法及系统 Download PDF

Info

Publication number
WO2012171244A1
WO2012171244A1 PCT/CN2011/077890 CN2011077890W WO2012171244A1 WO 2012171244 A1 WO2012171244 A1 WO 2012171244A1 CN 2011077890 W CN2011077890 W CN 2011077890W WO 2012171244 A1 WO2012171244 A1 WO 2012171244A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
deduplication
metadata
lba address
address space
Prior art date
Application number
PCT/CN2011/077890
Other languages
English (en)
French (fr)
Inventor
刘慧�
Original Assignee
北京卓微天成科技咨询有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京卓微天成科技咨询有限公司 filed Critical 北京卓微天成科技咨询有限公司
Priority to US13/380,935 priority Critical patent/US20120317084A1/en
Publication of WO2012171244A1 publication Critical patent/WO2012171244A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0689Disk arrays, e.g. RAID, JBOD

Definitions

  • the present invention relates to the field of data storage technologies, and in particular, to a method and system for implementing deduplication on a block-level virtualized storage device. Background technique
  • Deduplication technology is of great importance in the context of doubling global data volumes every 18 to 24 months and a significant increase in the data retention period required by law. This technology is one of the important means for companies to reduce storage overhead, thereby reducing IT overhead and maintaining competitiveness.
  • the deduplication application technology on traditional block-level storage devices has matured and has been commercially available on a large scale.
  • the virtualized storage device system architecture adds a layer of virtualization to the traditional storage architecture.
  • a three-tier architecture with a host layer, a virtualization layer, and a physical storage device layer (such as JBOD, disk array, etc.) is formed.
  • the host layer and physical storage device layer are exactly the same as the traditional storage system.
  • the virtualization layer is a software layer (or a software function module embedded in the hardware). In the virtualization layer, a unified storage device pool is built.
  • the virtual LUN is provided to the front-end host for mounting, and the use is eliminated.
  • the deduplication function is implemented in the physical storage device layer, and the storage virtualization layer is required to be used as a medium, and all or part of the storage devices connected thereto need to have a deduplication function.
  • this method has the following limitations: 1
  • the scope of data deduplication is often limited to a specific storage device, but can not achieve full data range deduplication, affecting the proportion and effect of the overall deduplication; 2 heterogeneous storage Data migration between devices requires another independent host to restore data before migration, which affects the performance of data migration.
  • Different metadata management mechanisms and strategies used by data storage devices with deduplication are not easy to implement. Integrate unified management of heterogeneous storage resources. Summary of the invention
  • the present invention proposes a virtualization layer (non-host layer and physical storage device layer) implementation on a block-level virtualized storage device.
  • a method of deduplication the method comprising:
  • the method further includes: setting a deduplication policy and a deduplication minimum data operation unit.
  • the step of deleting the duplicate data in the actual physical data corresponding to the specified virtual LBA address space includes:
  • the specified length data into the data segment of a specified size according to the deduplication minimum data operation unit;
  • step of obtaining the data segment after the physical data is deduplicated further comprises: updating metadata of the data segment after the physical data is deduplicated.
  • the deduplication minimum data operation unit is an integer multiple of a block, an integer multiple of a bit, or an integer multiple of a byte.
  • the structure of the block level virtualized storage device is an in-band or out-of-band architecture.
  • the present invention provides a system for implementing deduplication on a block-level virtualized storage device, the system comprising:
  • a virtual LUN device is provided for mounting and using the front-end host.
  • a deduplication module configured to delete duplicate data in actual physical data corresponding to a specified virtual LBA address space, and obtain a deduplicated data segment
  • a global metadata management module configured to establish a correspondence between the virtual LBA address space and the deduplicated data segment, manage and update metadata in the global metadata pool device, and according to the received virtual LBA address space And the corresponding relationship and the metadata information of the deduplicated data segment, obtaining storage location information of the actual physical data corresponding to the virtual LBA address space, and transmitting the storage location information;
  • a global metadata pool device configured to store correspondence information established by the global metadata management module and metadata information of the de-duplicated data segment obtained by the deduplication module;
  • a storage virtualization module configured to send a virtual LBA address space of the external data read/write I/O request to the global metadata management module, and receive the virtual LBA address space corresponding to the global metadata management module The storage location information of the actual physical data, complete the I/O redirection;
  • a physical LUN device that stores actual physical data.
  • the deduplication module includes:
  • a setting unit configured to set a deduplication policy and a deduplication minimum data operation unit
  • an obtaining unit configured to acquire actual physical data storage location information corresponding to the specified virtual LBA address space
  • An extracting unit configured to extract, according to the actual physical data storage location information acquired from the acquiring unit, a deduplication minimum data operation unit set by the setting unit, and extract a designation for deduplication from the physical LUN device Length data
  • a dividing unit configured to divide, according to the deduplication policy set by the setting unit, the specified length data extracted by the extracting unit into the data of a specified size according to the deduplication minimum data operation unit set by the setting unit Paragraph
  • a data fingerprint library unit for storing a data fingerprint
  • a data deletion unit for calculating a data segment of a data segment of a specified size divided by the segmentation unit, and comparing with a data fingerprint stored by the data fingerprint library unit, and transmitting a comparison result
  • a metadata management and update unit And receiving the comparison result, and sending the content and request of the metadata update to the global metadata management module when the comparison result is that the data fingerprint is the same.
  • the deduplication minimum data operation unit is an integer multiple of a block, an integer multiple of a bit, or an integer multiple of a byte.
  • the present invention also provides a system for implementing deduplication on a block-level virtualized storage device, the system comprising:
  • a virtual LUN device is provided for mounting and using the front-end host.
  • the storage virtualization metadata pool device is configured to store metadata information corresponding to the virtual LBA address space; and the deduplication metadata pool device is configured to store metadata information of the data segment deduplicated by the deduplication module;
  • a data deduplication module configured to delete duplicate data in the actual physical data corresponding to the specified virtual LBA address space, obtain the deduplicated data segment, and update the metadata information in the deduplication metadata pool device;
  • a global metadata management module configured to establish a correspondence between the virtual LBA address space and the deduplicated data segment, and synchronously coordinate and update metadata of the storage virtualization module and the deduplication module;
  • a storage virtualization module configured to obtain, according to the correspondence established by the global metadata management module and the metadata information of the data segment deduplicated by the deduplication module, to obtain a virtual LBA address space corresponding to the external data read/write request Storage location information of actual physical data, complete I/O redirection, and update metadata information in the storage virtualization metadata pool device;
  • the deduplication module includes:
  • a setting unit configured to set a deduplication policy and a deduplication minimum data operation unit
  • an obtaining unit configured to acquire, from the physical LUN device, actual physical data storage location information corresponding to the specified virtual LBA address space
  • An extracting unit configured to extract, according to the actual physical data storage location information acquired from the acquiring unit, a deduplication minimum data operation unit set by the setting unit, and extract a designation for deduplication from the physical LUN device Length data
  • a dividing unit configured to divide, according to the deduplication policy set by the setting unit, the specified length data extracted by the extracting unit into the data of a specified size according to the deduplication minimum data operation unit set by the setting unit Paragraph
  • a data fingerprint library unit for storing a data fingerprint
  • a data deletion unit for calculating a data segment of a data segment of a specified size divided by the segmentation unit, and comparing with a data fingerprint stored by the data fingerprint library unit, and transmitting a comparison result;
  • a metadata management and update unit Receiving the comparison result, and when the comparison result is the same as the data fingerprint, the metadata of the de-duplicated data segment is updated by the coordination of the global metadata management module, and sent to the deduplication metadata pool.
  • the deduplication minimum data operation unit is an integer multiple of a block, an integer multiple of a bit, or an integer multiple of a byte.
  • the technical solution provided by the present invention can delete duplicate data across hosts and storage devices, and achieve a wider range of deduplication
  • the technical solution provided by the present invention does not occupy the host system resources, thereby ensuring that the service program running on the host can run smoothly; 3.
  • the technical solution provided by the present invention can centrally manage and protect the metadata of the deduplication function, and the entire system design and implementation.
  • FIG. 1 is a schematic structural diagram of a system for implementing deduplication on a block-level virtualized storage device according to Embodiment 1 of the present invention
  • FIG. 2 is a flow chart of a method for implementing deduplication on a block-level virtualized storage device according to an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a data deletion module according to Embodiment 1 of the present invention.
  • FIG. 4 is a schematic structural diagram of a system in which a deduplication module is not deployed according to Embodiment 1 of the present invention
  • FIG. 5 is a schematic structural diagram of a system in which deduplication data has not been deleted after deploying a deduplication module according to Embodiment 1 of the present invention
  • FIG. 6 is a schematic structural diagram of a system in which part of data has been deduplicated after deploying a data deduplication module according to Embodiment 1 of the present invention
  • FIG. 7 is a schematic diagram of a system structure of an online data read and write operation after data deduplication according to Embodiment 1 of the present invention.
  • FIG. 8 is a schematic structural diagram of a system for combining a global metadata pool device and a virtual LUN device to collectively manage metadata according to Embodiment 1 of the present invention
  • FIG. 9 is a schematic diagram of a correspondence relationship between a virtual LBA address space and a data segment after deduplication according to Embodiment 1 of the present invention.
  • FIG. 11 is a schematic structural diagram of a unified metadata management system according to Embodiment 1 of the present invention. detailed description
  • the segment that is, the physical data is included in the actual data mapped by the virtual LBA address segment
  • the data in the original storage location may be incomplete ( Some or all of the data may have been merged into the corresponding data segment reference), then, if the I/O request arriving at the virtual LBA address on the virtual LUN is redirected to the actual LBA address space of the physical data, Incomplete or invalid data.
  • the smallest data unit managed by the block-level virtualized storage device is usually the smallest data unit managed by the storage medium.
  • the minimum data unit is called a block. Taking a disk as an example, the size is usually 512 bytes. Other storage media such as tapes are similar.
  • Traditional deduplication technology is usually in bytes.
  • byte is the minimum unit of operation, de-duplicating the data and comparing the de-duplication (in theory, it can also be bitwise)
  • deduplication cannot be applied directly to the virtualization layer of a block-level virtualized storage device.
  • reading and writing data on a block-level virtualized storage device is a unit of a block. Taking a disk as an example, the length is 512 bytes.
  • the data to be deduplicated is usually one. The byte is the smallest unit.
  • the deduplication technology is directly applied to the block-level virtualized storage device, the data stored in one block before the original data is deduplicated may be stored in at least two blocks after the data is deduplicated (such as a The first half of the data in the block is placed in one data segment reference and the second half is placed in another data segment reference).
  • the present invention provides a heavy implementation layer on a virtualization layer of a block-level virtualized storage device.
  • a method for deleting a data the method obtains the correspondence between the virtual LBA address space and the corresponding data segment obtained by deduplicating the corresponding physical data, and further obtains the data according to the correspondence information and the metadata information of the corresponding data segment.
  • the actual data corresponding to the virtual LBA address space saves the location information, and completes the I/O redirection.
  • a deduplication minimum data manipulation unit needs to be set.
  • the block-level virtualized storage device may affect the pointing relationship of the virtual LBA address of the data to its actual physical data storage location to some extent due to the introduction of other functions; in other words, the two may It is not a direct-pointing relationship in a typical storage virtualization device, but an indirect pointing relationship that requires several conversions, such as virtual layer RAID provided by some block-level virtualized storage devices, or multi-level virtualization (in order to increase the virtual address space capacity).
  • System design that maps multiple virtual LUNs to each other. However, no matter which system design, you can always get the pointing information of the specified data virtual LBA address on the specified virtual LUN to its corresponding physical data storage location.
  • the method and technical solution of the present invention mainly rely on the pointing information of the data virtual LBA address provided by the block-level virtualized storage device to the actual storage location of the data, and how the pointing information is obtained on the virtualized storage device.
  • the design of the above different virtualized storage devices does not affect the application of the technical solutions described in the present invention, and does not affect the scope of protection of the present invention.
  • the description of the following embodiments of the invention is only taken as an example of a typical storage virtualization system design, that is, the direction of the virtual LBA address of the data to its corresponding physical data storage location is a direct direct relationship.
  • the deduplication minimum data operation unit may be set to an integer multiple of a block, an integer multiple of a byte, or a bit according to system design requirements.
  • the integer multiple of the level is set to an integer multiple of the byte and the bit, although avoiding the waste of too much space, greatly increases the amount of data of the metadata, and increases the difficulty of metadata management.
  • the deduplication minimum data operations unit is unified, it is only related to how to implement the deduplication function itself (that is, how to divide and manage the specified length of data) Data), without affecting the scope of application of the present invention - implements deduplication at the virtualization layer of a block-level virtualized storage device. Therefore, in the following, in order to illustrate the embodiment of the present invention, only the deduplication minimum data operation unit is set to the block level (ie, the multiple level of the block) as an example.
  • the core of the method implemented by the present invention is to obtain the correspondence relationship between the data virtual virtual LBA address space and the actual physical data corresponding to the virtual LBA address space and the metadata information of the deduplicated data segment, and
  • the above information is usually stored in the storage virtualization and deduplication metadata, and the management and update are completed by the respective functional modules, and there is no synchronization mechanism, such as related virtual
  • the information of the LBA address is stored in the metadata of the storage virtualization, which is managed and updated by the storage virtual module, and the information about the data segment is stored in the deduplication metadata information, which is managed and updated by the deduplication module.
  • the first system that is, the system described in Embodiment 1, uniformly manages and updates the global metadata information, and serves the implementation of functions such as storage virtualization and deduplication;
  • the second system that is, the system described in Embodiment 2, After the synchronization of the entire system level, the metadata information serving different functions is managed and updated by the respective functional modules.
  • the implementation details of the two systems are described separately below.
  • Embodiment 1 Unified Metadata Management System
  • an embodiment of the present invention provides a unified metadata management system for implementing duplicate data deletion on a block-level virtualized storage device, where the system includes:
  • a virtual LUN device configured to store a virtual storage device that the virtualization module provides for front-end host mounting and use;
  • the deduplication module is configured to delete duplicate data in the actual physical data corresponding to the specified virtual LBA address space, and obtain the deduplicated data segment;
  • a storage virtualization module that sends a virtual LBA address space for external data read and write I/O requests Giving the global metadata management module, and receiving the storage location information of the actual physical data corresponding to the virtual LBA address space sent by the global metadata management module, and completing the I/O redirection;
  • the global metadata pool device is configured to store the correspondence information established by the global metadata management module and the metadata information of the de-duplicated data segment obtained by the deduplication module, which is a device corresponding to the virtual LUN;
  • the deletion policy (such as the embodiment of the present invention), then, for the virtual LBA address space in which the duplicate data has not been deleted, the correspondence information between the virtual LBA address space and the actual physical data storage location is also saved in the global metadata pool device;
  • the global metadata pool device may be saved and maintained in the form of a file or a table in the database;
  • the global metadata management module is configured to establish a correspondence between the virtual LBA address space and the deduplicated data segment, create and initialize a global metadata pool device, manage and update metadata in the global metadata pool device, and receive the metadata according to the The virtual LBA address space, the correspondence relationship, and the metadata information of the deduplicated data segment, obtain the storage location information of the actual physical data corresponding to the virtual LBA address space, and send the storage location information; if a late deduplication policy is adopted (eg, In the embodiment of the present invention, the actual physical data corresponding to the virtual LBA address space requested by the external I/O may not be deduplicated, and then the global metadata management module directly returns the virtual LBA address space corresponding to the global metadata pool device. Actual physical data storage location information;
  • a physical LUN device which is used to store actual physical data. It is usually a storage logical unit that is divided into a large storage medium (such as a disk array) in the physical storage device layer.
  • the logical unit number (that is, LUN) is used.
  • the deduplication module includes, as shown in FIG. 3:
  • the deduplication minimum data operation unit may be set to an integer multiple of a block, an integer multiple of a bit, or an integer multiple of a byte.
  • An obtaining unit configured to acquire actual physical data storage location information corresponding to the specified virtual LBA address space;
  • an extracting unit configured to: according to the actual physical data storage location information acquired from the obtaining unit, extract the specified data of the deduplication data from the physical LUN device according to the deduplication minimum data operation unit set by the setting unit;
  • a dividing unit configured to: according to the deduplication policy set by the setting unit, divide the specified length data extracted by the extracting unit into a data segment of a specified size according to the deduplication minimum data operation unit set by the setting unit;
  • the data fingerprint library unit is configured to store the data fingerprint; in the process of deduplication, the data fingerprint is newly compared with the data fingerprint in the data fingerprint database to implement the deduplication function; the deduplication unit is used to calculate the segmentation unit. Dividing the data fingerprint of the specified size data segment and comparing it with the data fingerprint stored by the data fingerprint library unit, and transmitting the comparison result;
  • the metadata management and update unit is configured to receive the comparison result, and send the content and the request of the metadata update to the global metadata management module when the comparison result is the same as the data fingerprint, and the global metadata management module combines the data deduplication process In the case of data read and write and information, the metadata of each deduplicated data segment is updated.
  • the functions of the global metadata management module include: 1) responsible for coordinating conflicts between data read and write processes and deduplication processes when data is read and written (eg, the actual data pointed to by a virtual LBA address is The data read and write process and the deduplication process are simultaneously requested); 2) interact with the deduplication module, and is responsible for updating the metadata information of the deduplicated data segment in the global metadata pool device, and ensuring the metadata corresponding to each virtual LBA address. The validity and consistency of the information.
  • the global metadata pool device and the global metadata management module uniformly store and manage the metadata corresponding to all functions of the entire system, according to the location of the global metadata pool device in the entire system.
  • the whole system can have multiple topological structure designs, as shown in Figure 11 and Figure 8;
  • a metadata storage device ie, global metadata pool device
  • the global metadata pool device is merged with the virtual LUN device.
  • FIG. 11 is taken as an example to describe the details of the implementation of the entire system.
  • the global metadata pool device is uniformly managed and maintained by the global metadata management module, and all the metadata of the entire system is saved, and the functions of the system are served.
  • the storage virtualization and deduplication functions are taken as an example, and other functions such as RAID, etc., are similar in implementation methods, and are not described here; in other topology structures, there will also be global elements.
  • the data management module functions like modules and mechanisms to maintain and manage metadata. Because the implementation is similar, it is not discussed here.
  • virtualization of block-level virtualized storage devices can be implemented in a variety of ways, typically with in-band architecture.
  • the main commercial products are IBM SAN Volume Controller (SVC), IBM DS8000 series. , Hitachi VSP series, EMC VPLEX, DataCore SAN symphony- V, out-of-band architecture, the main commercial products are EMC Invista.
  • SVC IBM SAN Volume Controller
  • IBM DS8000 series. Hitachi VSP series
  • EMC VPLEX DataCore SAN symphony- V
  • out-of-band architecture the main commercial products are EMC Invista.
  • the core idea is to create a virtual LUN for the front-end host to mount and use, and map and convert the virtual LBA address space on the virtual LUN to the physical location where the real data is stored, to reach the virtual LUN. Data read and write I/O redirection.
  • the implementation of the method of the present invention mainly depends on the virtual LUN of the virtualization layer and its metadata, the difference between the foregoing implementation methods (such as whether the data path and the control path are separated) is not affected. Scope of the Invention In order to simplify the description of the feasibility of the present invention, the embodiment of the present invention is described by taking the virtualization implementation of the in-band block-level virtualized storage device as an example.
  • the data deduplication technology also has multiple implementations, typical Fixed-length dedup, variable-length dedup, and hybrid-length dedup.
  • the core idea is to divide the data of the specified length according to a predetermined algorithm. Data segments that meet the required size are calculated by comparing the fingerprints of the data segments, and the duplicate data is removed, and a data segment reference is retained. Through the metadata of each data segment, all the I/Os that reach the specified data segment data are read and written. Redirection. Since the different implementations of the deduplication technology only affect the deduplication performance and effects, etc., without affecting the feasibility of the present invention, it does not affect the above-mentioned deduplication solution of the present invention.
  • variable length deduplication technology as an example to illustrate that fixed length deduplication can be regarded as a special case of variable length deduplication.
  • the deduplication scheme can be divided into online real-time data de-increment (in-line dedup) and post-processing de-duplication (post-processing dedup).
  • in-line dedup online real-time data de-increment
  • post-processing de-duplication post-processing dedup
  • the technical innovation of the present invention is to apply the deduplication solution to the virtualization layer of the block-level virtualized storage device, instead of discussing how to perform deduplication; and, the deduplication technology has matured, and Large-scale commercial applications. Therefore, the details of the implementation details of the deduplication technology, such as the data segmentation algorithm, the calculation and comparison of the data fingerprint, in the embodiment of the present invention will be omitted, and will not be explained in depth.
  • the deduplication feature is the basis for discussion.
  • block the smallest unit of data managed by a storage medium.
  • a block is a sequence of bytes or bits. It usually has a fixed length. In the case of a disk, the size is usually 512 bytes, similar to other storage media such as tape.
  • data extent a concept used to describe the deduplication function. It refers to the deduplication function module according to a predetermined algorithm before deleting duplicate data (the data segmentation method of different deduplication schemes is also different)
  • the data of the specified length is divided into a plurality of data segments that meet the required size; by calculating the fingerprints of the data segments, the similarities and differences are compared, and the duplicate data is deleted.
  • the data segment represents a logical concept, with its corresponding data segment metadata information, pointing to the actual physical data stored in its corresponding data segment reference.
  • Data extent reference A concept used to describe the deduplication function is to save only one copy of their physical data on a specified storage medium for data segments with repeated content after deduplication. And establishing a reference relationship of the data segments to the copy of the unique physical data, where the only physical data copies referenced by the plurality of data segments are referred to as data segment references corresponding to the data segments.
  • Data extent metadata Describes the concept of deduplication. It refers to the reference information (also called the information) of the stored data segment and its corresponding data segment reference address when the data is deduplicated. Or pointer information); The information also includes the actual location information saved by the data segment reference (such as the location of the physical device where the LUN is located and the corresponding LBA address on the LUN). After the data is deduplicated, all I/Os that arrive at the data segment are redirected to their corresponding data segment references based on the metadata corresponding to the data segment.
  • Virtual LBA address metadata Servers the storage virtualization data access I/O redirection function, which is used to redirect from the specified virtual LBA address to the actual data storage location.
  • the metadata information can be based on the design needs of the system, including different information. If software RAID or multi-level virtualization is implemented at the virtual layer, then the metadata will contain the information necessary to redirect the virtual LBA address to the actual data save location after joining these functions.
  • the metadata will include the following information: Whether the actual data corresponding to the specified virtual LBA address has been deduplicated, and if it has been deduplicated, the corresponding data segment and the offset of the relative data segment header Quantity; if there is no deduplication, the virtual LBA address corresponds to the actual data storage location pointing information.
  • Virtual LUN metadata A collection of virtual LBA address metadata contained in a virtual LUN. In reality, this metadata can be saved and maintained in the form of a file or a table in a database.
  • Storage virtualization metadata mainly includes at least one virtual LUN metadata and information that supports other features of the virtual LUN, such as RAID.
  • Data dedup metadata A data that mainly includes the data segment and the necessary metadata maintenance function information (such as spatial planning and deployment of metadata storage).
  • an embodiment of the present invention provides a method for implementing deduplication on a block-level virtualized storage device, including the following steps:
  • Step 101 At the virtualization level of the virtualized storage device at the block level, deploy the deduplication module and the global metadata management module, and create and initialize the global metadata pool device for the specified virtual LUN; according to actual system requirements, such as performance, The function and the deduplication ratio target, etc., select the data deduplication scheme, and then deploy the corresponding deduplication module according to the selected data deduplication scheme; as described above, this embodiment selects the current mainstream variable length and post data. Heavy plan
  • the corresponding deduplication policy must be developed, including: setting the startup time of the deduplication engine (such as the night when the data is read and written infrequently), and setting the time for data deduplication recovery. And cycle, etc.; deduplication strategy development, often with duplicate data deletion In addition to the functional design of the module, different data deduplication schemes may result in different corresponding deduplication strategies;
  • the global metadata management module After the deduplication module is deployed, the global metadata management module is deployed. Then, the global metadata management module creates a corresponding global metadata pool for the specified virtual LUN.
  • each virtual LUN can be created.
  • An exclusive global metadata pool can also be shared with other virtual LUNs. The embodiment of the present invention only creates an exclusive global metadata pool for each virtual LUN. Explain the case;
  • the global metadata management module needs to initialize it.
  • the specific steps are as follows: 1) Create a global metadata pool Dedup vLUN for a certain virtual LUN, and the global metadata management module through storage virtualization The module obtains the virtual LBA address space and the virtual LB A address space on the virtual LUN to the allocated actual LB A address space pointing information, and copies it to the corresponding Dedup vLUN; in other words, each time on the virtual LUN A certain virtual LBA address can be found in the Dedup vLUN and the same virtual LBA address and the actual physical data storage location point corresponding to the virtual LBA address; if the actual LBA address space corresponding to the virtual LUN is dynamically allocated (such as In the case of using the fine tube configuration), then after the allocation, copy the above information to the Dedup vLUN; 2) In the initial state, the actual physical data corresponding to the virtual LBA address in the global metadata pool is not gone. Heavy, use the "not deduplicated" status flag to mark
  • the storage virtualization module needs to transmit the virtual LBA address to the global metadata management module.
  • the global metadata management module returns the location information of the actual physical data storage to the storage virtualization module, and the storage virtualization module completes the I/O redirection;
  • Figure 4 is no deployment weight
  • the system structure of the data deletion function module is as shown in Figure 4.
  • the storage virtualization is to map the virtual LBA address on the virtual LUN to the actual LUN (LUN A, LUN B, LUN C in Figure 4).
  • the actual LBA address is used to complete the redirection of the I/O request sent by the host.
  • Figure 5 is a schematic diagram of the system that has not been deleted after the deduplication function module is deployed.
  • the Dedup vLUN is a global metadata pool corresponding to the virtual LUN.
  • the virtual LBA address space of the virtual LUN (through the global metadata management module) will correspond to the virtual LBA address space of the Dedup vLUN, and the Dedup vLUN also stores the corresponding virtual LBA address space.
  • the actual physical data is stored in the location information.
  • the data segment after the physical data is deduplicated;
  • the virtual LBA address space in the embodiment of the present invention is a virtual LBA address segment, and includes a plurality of consecutive or discontinuous virtual LBA addresses;
  • the setting unit unifies the deduplication minimum data operation unit to the block level to be consistent with the smallest data unit of the storage medium
  • the duplicate data in the actual physical data corresponding to the specified virtual LBA address space is deleted, and the data segment after the physical data is de-duplicated is obtained, which specifically includes the following sub-steps: 1) in the deduplication module After the interaction unit interacts with the global metadata management module, acquires the specified virtual LBA address space that is not deduplicated and its corresponding physical physical data storage location information; 2) according to the virtual LBA address space acquired by the acquisition unit
  • the actual physical data stores the location information
  • the extracting unit in the deduplication module extracts the specified length data for deduplication according to the physical boundary specified by the actual physical data storage location information according to the boundary of the block, that is, the start of the extracted data.
  • the dividing unit in the deduplication module divides the extracted specified length data into blocks of a specified size in units of blocks (the data segment after each cutting) It is also composed of at least one complete block); 4) The deduplication unit in the deduplication module calculates the data fingerprint of the segmented data segment of a specified size, and compares and deduves with the data fingerprint stored in the data fingerprint library unit to obtain Specify a data segment after the physical data corresponding to the virtual LBA address space is deduplicated;
  • step 1) the global metadata management module needs to select, according to the information of the saved virtual LBA address space, whether the specified virtual LBA address has been deduplicated, and the I/O request condition of the storage virtualization module.
  • a virtual LBA address that is not occupied by the data read and write process is sent to the deduplication module for deduplication;
  • Step 103 Update metadata of the deduplicated data segment, establish a correspondence between the virtual LBA address space and the deduplicated data segment, and update metadata of the virtual LBA address included in the virtual LBA address space.
  • the metadata management and update unit in the deduplication module sends the content and request of the metadata update to the global metadata management module, and the global metadata management module integrates the data.
  • the data read and write situation and information in the deduplication process, and the metadata of each deduplicated data segment is updated;
  • the global metadata management module establishes a correspondence relationship between the virtual LBA address space for data deduplication and the corresponding physical data deduplicated data segment; as shown in FIG. 9, the virtual LB A of the data The address space corresponds to the actual LB A address space on the physical LUN.
  • the data segments DE1, DE2, and DE3 are obtained, which respectively point to the data segment reference DI 1 and DI 2. DI 1;
  • the virtual LBA address space can be obtained by de-emphasizing the pointing and correspondence of the actual LBA address space before the same data.
  • Each virtual LBA address is associated with each of the data segments DEI, DE2, DE3 (because the deduplication minimum data manipulation unit is a block here, consistent with the smallest data management unit of the storage medium), expression of the double arrow this correspondence, i.e., ⁇ ⁇ DE2 and c 2 are in a corresponding;!
  • the metadata includes the physical physical data storage location corresponding to the virtual LBA address. information;
  • the initiation and execution of the physical space recovery may have different choices in different system designs.
  • the management of the entire physical space may be
  • the storage virtualization module is responsible for, and the recovery of its space can also be initiated by it, which is completed by the deduplication module;
  • FIG. 5 is a schematic diagram of a system that has not deleted duplicate data after deploying the deduplication function module
  • the Dedup vLUN is a global metadata pool corresponding to the virtual LUN.
  • Figure 6 is a schematic diagram of a system in which some data has been deduplicated after the deduplication module is deployed.
  • the length of each data segment may be different for the variable length deduplication technology.
  • this embodiment creates a physical LUN device named "Dedup LUN" on the storage medium for storage.
  • the data deletion minimum data operation unit has been set to the block level, so & is an integer multiple of the storage medium block length, and each data segment Corresponding number According to the segment reference, it is composed of several complete blocks.
  • the Dedup vLUN needs to save the metadata information and deduplication corresponding to each virtual LBA address. Metadata information of the subsequent data segment;
  • Step 104 Read and write an I/O request for data that reaches a certain virtual LBA address space on the virtual LUN, and obtain the data according to the stored correspondence between the virtual LBA address space and the deduplicated data segment and the metadata information of the data segment.
  • the storage location information of the actual physical data completes the redirection of the read/write I/O of the virtualized storage device data;
  • the design of this step is mainly based on the redirection of data I/O after deduplication, which is also the core problem that the present invention attempts to solve, and the virtual access to external data I/O.
  • the actual physical data corresponding to the LBA address has not been deduplicated, such as adopting a deduplication deduplication policy (such as the embodiment of the present invention)
  • the I/O redirection is performed.
  • the information of the virtual LBA address in the virtual LBA address metadata and the actual physical data storage location is pre-existing.
  • the information about whether the actual physical data corresponding to the specified virtual LBA address is deduplicated in the virtual LBA is saved in the virtual LBA.
  • the metadata of the address is available; when an external data access I/O request arrives at the specified virtual LBA address, the storage virtualization module sends the virtual LBA address to the global metadata management module, and the global metadata management module according to the virtual
  • the metadata information corresponding to the LBA address determines the actual physical data corresponding to the virtual LBA address.
  • the storage virtualization module If it has not been deduplicated, return the actual physical data storage location information corresponding to the virtual LBA address to the storage virtualization module; if it has been deduplicated, according to the metadata information of the virtual LBA address (corresponding to The data segment and the offset of the header of the relative data segment), and the metadata information of the corresponding data segment (including the actual storage location information referenced by the corresponding data segment), obtain the actual by the following calculation (refer to FIG. 6) The storage location information of the physical data is returned to the storage virtualization module:
  • the virtual LBA address vLa of the host data read/write I/O request corresponds to the Dedup vLUN.
  • the physical data has been deduplicated, corresponding to the position of the deviated data segment c k from the head offset rLa.
  • the deduplication minimum data operation unit is a block level, so rLa is vLa
  • the relative LBA address length of the corresponding position in c k relative to its head, the actual data storage location pLa corresponding to the acquired vLa, is actually an actual LBA address referenced by the data segment corresponding to c k , which can be obtained by formula (1) obtain:
  • (1 (1 is the data segment corresponding to the data block refers to the starting LBA address of the saved physical location
  • the information is the known information stored in the data segment metadata after the data is deduplicated; meanwhile, rLa is also deduplicated in the data
  • the known information stored in the virtual LBA address metadata in the process so, through the above calculation, the actual data storage location information pLa corresponding to the virtual LBA address vLa can be obtained;
  • the storage virtualization module can complete the read and write I/O redirection of the virtual LUN data and the actual read and write of the data, including the following cases:
  • the metadata of all the virtual LBA addresses already contain the storage location information of the corresponding physical data.
  • the data read operation differs from the data read operation before deduplication, as shown in Figure 7: Assume that an external read I/O request is dispatched to a virtual LBA address on the virtual LUN (ie To access the physical data mapped by 1 ⁇ to bn, the data read request of the virtual LBA address is sent by the storage virtualization module to the global metadata management module, and the deduplicated data segment corresponding to the global metadata management module is Part of the data between c 2 and c 6 (ie, the data corresponding to the block from the second block of c 2 to the second block of c 6 ), after the conversion process of the virtual LB A address described above, The corresponding actual data is stored in the LBA address (may not be continuous) and returned to the storage virtualization module, which stores the data from the specified physical location and returns the external data read I/O request;
  • the data write operation differs from the data write operation before deduplication, as shown in Figure 7: Assume that an external write I/O request is dispatched to a virtual LBA address on the virtual LUN (ie To access the physical data mapped by ⁇ to bn, the storage virtualization module sends a write request for the virtual LBA address segment to the global metadata management module, and the deduplicated data segment sent by the global metadata management module is c. Partial data between 2 and c 6 (ie data corresponding to the block from the second block of c 2 to the second block of c 6 ); then,
  • the global metadata management module allocates a new storage space for the write I/O on the back storage medium through the storage virtualization module, and returns the new storage space location information to the storage virtualization module, and saves The storage virtualization module in turn redirects external write I/O to the newly allocated storage location and writes the data;
  • the global metadata management module allocates a new storage space on the back-end storage medium through the storage virtualization module, and the data segment in the data segment that is not affected by the write-once I/O by the deduplication module (ie, the number of c 2 a block and the third block of c 6 ) copy the corresponding actual data in the data segment reference to the newly allocated storage location, and save it;
  • the global metadata management module updates the metadata information of the virtual LBA address segment corresponding to the data segments c 2 ⁇ c 6 in the global metadata pool: 1 Update the virtuality on the Dedup vLUN affected by the write I/O
  • the LBA address segment metadata information is updated to the new data storage location in step (1); 2 updating the data segment associated with the write I/O does not necessarily affect the Dedup
  • the virtual LBA address segment metadata on the vLUN that is, the metadata of the virtual LBA address segment corresponding to the first block of c 2 and the third block of c 6 , updates the pointing information of the actual data storage location to the first (step 2) their actual copy of data storage locations; 3 the data segments to the c virtual LBA address on dedup vLUN corresponding 6 (greater than the time a write I / O affected virtual LBA address) labeled
  • the deduplication module then deduplicates it according to a predetermined deduplication policy;
  • the virtual LBA address involved can be allowed. Metadata update, the direction information of the actual data location is updated, the virtual LBA address corresponding to the actual data has been deduplicated, and the corresponding data segment and the offset of the relative data segment header;
  • Embodiment 2 Metadata Scheduling Management System
  • the global metadata pool device similar to the embodiment 1 uniformly stores and manages the metadata of the entire system, and the metadata and data of the virtual LBA address are deduplicated instead.
  • the metadata of the data segment is managed and updated by the storage virtualization module and the deduplication module, respectively, as shown in FIG.
  • the contents of these two pieces of metadata are basically the same as those in Embodiment 1.
  • the global metadata management module plays the same role in the whole system as the first embodiment, that is, it is no longer responsible for initializing the global metadata pool device and unified metadata management and update. It is a synchronous coordination and interaction that focuses on metadata updates for storage virtualization and deduplication modules.
  • an embodiment of the present invention further provides a metadata partitioning management system for implementing deduplication on a block-level virtualized storage device, where the system includes:
  • a virtual LUN device for mounting and using the front-end host.
  • a storage virtualization metadata pool device configured to store metadata information corresponding to the virtual LBA address space
  • a data deduplication metadata pool device configured to store metadata information of the data segment deduplicated by the deduplication module
  • the deduplication module is configured to delete duplicate data in the actual physical data corresponding to the specified virtual LBA address space, obtain the deduplicated data segment, and update the metadata information in the deduplication metadata pool device;
  • the global metadata management module is configured to establish a correspondence relationship between the virtual LBA address space and the deduplicated data segment, and synchronously coordinate the update and interaction of the storage virtualization module and the deduplication module metadata;
  • the storage virtualization module is configured to obtain the physical data corresponding to the virtual LBA address space pointed by the external data read/write request according to the correspondence established by the global metadata management module and the metadata information of the data segment deduplicated by the deduplication module. Storage location information, complete I/O redirection, and update metadata information in the storage virtualization metadata pool device;
  • a physical LUN device that stores actual physical data.
  • the deduplication module includes:
  • a setting unit configured to set a deduplication policy and a deduplication minimum data operation unit;
  • the deduplication minimum data operation unit is an integer multiple of a block, an integer multiple of a bit, or an integer multiple of a byte;
  • An obtaining unit configured to obtain actual physical data storage location information corresponding to the specified virtual LBA address space
  • An extracting unit configured to extract, according to the actual physical data storage location information acquired from the obtaining unit, the specified data length data for the deduplication data from the physical LUN device according to the deduplication minimum data operation unit set by the setting unit; a dividing unit, configured to divide the specified length data extracted by the extracting unit according to the deduplication policy set by the setting unit, into a data segment of a specified size according to the deduplication minimum data operation unit set by the setting unit;
  • the data fingerprint library unit is configured to store the data fingerprint; in the process of deduplication, the data fingerprint is newly compared with the data fingerprint in the data fingerprint database to implement the deduplication function; the deduplication unit is used to calculate the segmentation unit. Dividing the data fingerprint of the specified size data segment and comparing it with the data fingerprint stored by the data fingerprint library unit, and transmitting the comparison result;
  • the metadata management and update unit is configured to receive the comparison result, and when the comparison result is the same as the data fingerprint, update the metadata of the de-duplicated data segment by the coordination of the global metadata management module, and send the metadata to the deduplication metadata pool. device.
  • the storage location of the virtual LBA address metadata and the deduplicated data segment metadata is no longer a global metadata pool device, but is separately stored by the storage virtualization metadata pool device and the deduplication metadata pool device; metadata update It is not completed by the global metadata management module, but by the storage virtualization and deduplication module respectively; however, the synchronization coordination mechanism of the global metadata management module in the metadata content and metadata update process is basically the same as that of the first embodiment.
  • the request for obtaining the specified virtual LBA address metadata is obtained by the storage virtualization module from the storage virtualization metadata pool device after interacting with the global metadata management module; the storage virtualization module according to the virtual LBA address metadata information, Obtaining the specified data segment metadata, and sending the request to the global metadata management module, where the global metadata management module interacts with the deduplication module, and the deduplication module obtains the data from the deduplication metadata pool device, and Global metadata management module Finally, it is returned to the storage virtualization module; the content of the metadata required in the process is similar to that of Embodiment 1.
  • the method for implementing deduplication on the block-level virtualized storage device provided in this embodiment differs from the first embodiment in the following manner:
  • Step 101 deploying a deduplication module and a global metadata management module at a virtualization layer of the virtualization device at the block level;
  • the global metadata management module does not need to create and initialize a global metadata pool device in this step; other implementation details of this embodiment are basically the same as those in the first embodiment, and details are not described herein again.
  • the technical solution provided by the embodiment of the present invention can delete duplicate data across the host and the storage device, and implement a larger range of deduplication.
  • the technical solution provided by the embodiment of the present invention does not occupy the host system resources, thereby ensuring the service program running on the host.
  • the technical solution provided by the embodiment of the present invention can centrally manage and protect the metadata of the deduplication function, and the entire system design and implementation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种块级虚拟化存储设备上实现重复数据删除的方法及系统,属于数据存储技术领域。所述方法包括:删除指定虚拟LBA地址空间所对应的实际物理数据中的重复数据,获得物理数据去重后的数据段;建立虚拟LBA地址空间与物理数据去重后的数据段的对应关系;根据对应关系和数据段的元数据信息,获取外部数据读写请求指向的虚拟LBA地址空间对应的实际物理数据的存放位置信息,完成I/O重定向。本发明还提供了一种块级虚拟化存储设备上实现重复数据删除的系统。本发明可以跨主机和存储设备删除重复数据,实现更大范围的重复数据删除。

Description

块级虚拟化存 殳备上实现重复数据删除的方法及系统 技术领域
本发明涉及数据存储技术领域, 特别涉及一种块级虚拟化存储设备上实现 重复数据删除的方法及系统。 背景技术
在全球数据量平均每 18~24个月便翻一倍和迫于法律要求企业数据保存期 大幅增加的背景下, 重复数据删除技术具有很重要的意义。 该项技术是企业降 低存储开销, 进而降低 IT开销, 保持竟争力的重要手段之一。 传统块级存储设 备上的重复数据删除应用技术已经很成熟, 且已进行大规模商用。
然而随着存储虚拟化技术的引入, 存储系统的整体架构有了很大的变化, 这种变化主要表现在: 虚拟化存储设备系统架构在传统的存储体系结构中增加 了一层虚拟化层, 形成了具有主机层、 虚拟化层和物理存储设备层(如 JBOD、 磁盘阵列等) 的三层架构。 主机层和物理存储设备层与传统的存储系统完全一 致, 虚拟化层是一个软件层(或者是嵌入硬件内的软件功能模块)。 在虚拟化层 内置 1 ?ί一 统一的存储设备池, 通过构建物理 LUN ( Logical Unit Number, 逻辑单元号)与 虚拟 LUN之间的对应关系, 将虚拟 LUN提供给前端主机挂载使用, 消除了异 构存储设备之间的差异, 可以以统一界面管理所有存储资源, 大大筒化了存储 管理和使用的成本; 加之其所提供的精筒配置 ( thin provisioning )、 在线数据迁 移 (non-disruptive data migration)等功能, 极大地提高了存储设备的使用效率。
随着存储虚拟化技术的使用深入, 传统的重复数据删除解决方案在实施过 程中也暴露了不足, 具体表现在以下几个方面:
1、 在主机层实现重复数据删除功能, 要求用户在每台连接虚拟化存储设备 的主机( host )上部署重复数据删除软件,进而对该主机上的重复数据进行删除。 但是这种方法存在如下局限性: ①重复数据删除范围仅限于每一个安装重复数 据删除软件的主机及其所管理的数据, 不能实现跨主机重复数据的删除; ②在 每台主机上都需要安装重复数据删除软件, 由该软件执行的重复数据的指纹计 算和比较需要消耗很多资源, 会影响主机的性能。
2、在物理存储设备层实现重复数据删除功能,要求以存储虚拟化层为媒介, 其所连接的全部或者部分存储设备自身需具有重复数据删除功能。 但是这种方 法存在如下局限性: ①重复数据删除范围往往仅局限于某一特定存储设备内, 而不能实现全数据范围的重复数据删除, 影响整体重复数据删除的比例和效果; ②异构存储设备之间的数据迁移需借助另外一个独立主机, 将数据先还原后再 迁移, 影响数据迁移的性能; ③不同的具有重复数据删除的存储设备所使用的 元数据管理机制和策略不同, 不易实现整合异构存储资源的统一管理。 发明内容
为了克服传统方法在虚拟化存储设备上实现重复数据删除功能方面所存在 的局限性, 本发明提出了一种块级虚拟化存储设备上的虚拟化层(非主机层和 物理存储设备层) 实现重复数据删除的方法, 所述方法包括:
删除指定虚拟 LBA地址空间所对应的实际物理数据中的重复数据, 获得所 述物理数据去重后的数据段;
建立所述虚拟 LBA地址空间与所述物理数据去重后的数据段的对应关系; 根据所述对应关系和数据段的元数据信息, 获取外部数据读写请求指向的 虚拟 LBA地址空间对应的实际物理数据的存放位置信息, 完成 I/O重定向。
在所述删除指定虚拟 LB A地址空间所对应的实际物理数据中的重复数据的 步骤之前还包括: 设置重复数据删除策略及重复数据删除最小数据操作单元。
所述删除指定虚拟 LBA地址空间所对应的实际物理数据中的重复数据的步 骤具体包括:
根据所述重复数据删除最小数据操作单元, 从虚拟 LBA地址空间对应的实 际物理数据中提取用于重复数据删除的指定长度数据;
根据所述重复数据删除策略, 将所述指定长度数据按照所述重复数据删除 最小数据操作单元, 分割成指定大小的数据段;
计算所述指定大小的数据段的数据指纹, 并与数据指纹库中存储的数据指 纹进行比较, 根据数据指紋相同的比较结果, 删除实际物理数据中的重复数据。
所述获得所述物理数据去重后的数据段的步骤还包括: 更新所述物理数据 去重后的数据段的元数据。
所述重复数据删除最小数据操作单元为块的整数倍、 比特位的整数倍或字 节的整数倍。
所述块级虚拟化存储设备的结构为带内或者带外体系架构。
本发明提供了一种块级虚拟化存储设备上实现重复数据删除的系统, 所述 系统包括:
虚拟 LUN设备, 用于提供给前端主机挂载和使用;
重复数据删除模块, 用于删除指定虚拟 LBA地址空间所对应的实际物理数 据中的重复数据, 获得去重后的数据段; 全局元数据管理模块, 用于建立所述虚拟 LBA地址空间与所述去重后的数 据段的对应关系, 管理和更新全局元数据池设备中的元数据, 以及根据接收到 的虚拟 LBA地址空间、 所述对应关系和去重后的数据段的元数据信息, 获取所 述虚拟 LBA地址空间对应的实际物理数据的存放位置信息, 并发送所述存放位 置信息;
全局元数据池设备, 用于存储所述全局元数据管理模块建立的对应关系信 息及所述重复数据删除模块获得的去重后数据段的元数据信息;
存储虚拟化模块, 用于将外部数据读写 I/O请求的虚拟 LBA地址空间发送 给所述全局元数据管理模块, 以及接收所述全局元数据管理模块发送的所述虚 拟 LBA地址空间对应的实际物理数据的存放位置信息, 完成 I/O重定向;
物理 LUN设备, 用于存放实际物理数据。
所述重复数据删除模块包括:
设置单元, 用于设置重复数据删除策略及重复数据删除最小数据操作单元; 获取单元, 用于获取指定虚拟 LBA地址空间对应的实际物理数据存放位置 信息;
提取单元, 用于根据从所述获取单元获取的实际物理数据存放位置信息, 按照所述设置单元设置的重复数据删除最小数据操作单元, 从所述物理 LUN设 备中提取用于重复数据删除的指定长度数据;
分割单元, 用于根据所述设置单元设置的重复数据删除策略, 将所述提取 单元提取出的指定长度数据, 按照所述设置单元设置的重复数据删除最小数据 操作单元, 分割成指定大小的数据段;
数据指纹库单元, 用于存储数据指纹; 重复数据删除单元, 用于计算所述分割单元分割的指定大小的数据段的数 据指纹, 并与所述数据指纹库单元存储的数据指纹进行比较, 发送比较结果; 元数据管理及更新单元, 用于接收所述比较结果, 并在所述比较结果为数 据指紋相同时, 将元数据更新的内容和请求发送给所述全局元数据管理模块。
所述重复数据删除最小数据操作单元为块的整数倍、 比特位的整数倍或字 节的整数倍。
本发明还提供了一种块级虚拟化存储设备上实现重复数据删除的系统, 所 述系统包括:
虚拟 LUN设备, 用于提供给前端主机挂载和使用;
存储虚拟化元数据池设备,用于存储虚拟 LBA地址空间对应的元数据信息; 重复数据删除元数据池设备, 用于存储重复数据删除模块去重后的数据段 的元数据信息;
重复数据删除模块, 用于删除指定虚拟 LBA地址空间所对应的实际物理数 据中的重复数据, 获得去重后的数据段, 以及更新所述重复数据删除元数据池 设备中的元数据信息;
全局元数据管理模块, 用于建立所述虚拟 LBA地址空间与所述去重后的数 据段的对应关系, 以及同步协调存储虚拟化模块和重复数据删除模块的元数据 的更新及交互;
存储虚拟化模块, 用于根据所述全局元数据管理模块建立的对应关系和所 述重复数据删除模块去重后的数据段的元数据信息, 获取外部数据读写请求指 向的虚拟 LBA地址空间对应的实际物理数据的存放位置信息,完成 I/O重定向, 以及更新所述存储虚拟化元数据池设备中的元数据信息;
物理 LUN设备, 用于存放实际物理数据。 所述重复数据删除模块包括:
设置单元, 用于设置重复数据删除策略及重复数据删除最小数据操作单元; 获取单元, 用于从所述物理 LUN设备获取指定虚拟 LBA地址空间对应的 实际物理数据存放位置信息;
提取单元, 用于根据从所述获取单元获取的实际物理数据存放位置信息, 按照所述设置单元设置的重复数据删除最小数据操作单元, 从所述物理 LUN设 备中提取用于重复数据删除的指定长度数据;
分割单元, 用于根据所述设置单元设置的重复数据删除策略, 将所述提取 单元提取出的指定长度数据, 按照所述设置单元设置的重复数据删除最小数据 操作单元, 分割成指定大小的数据段;
数据指纹库单元, 用于存储数据指纹;
重复数据删除单元, 用于计算所述分割单元分割的指定大小的数据段的数 据指纹, 并与所述数据指纹库单元存储的数据指纹进行比较, 发送比较结果; 元数据管理及更新单元, 用于接收所述比较结果, 并在所述比较结果为数 据指纹相同时, 通过所述全局元数据管理模块的协调, 更新去重后数据段的元 数据, 发送给所述重复数据删除元数据池设备。
所述重复数据删除最小数据操作单元为块的整数倍、 比特位的整数倍或字 节的整数倍。
与现有技术相比, 本发明的上述技术方案的有益效果如下:
1、 本发明提供的技术方案可以跨主机和存储设备删除重复数据, 实现更大 范围的重复数据删除;
2、 本发明提供的技术方案不占用主机系统资源, 从而保证了主机上运行的 业务程序可以平滑运行; 3、 本发明提供的技术方案可以集中管理和保护重复数据删除功能的元数 据, 筒化整个系统设计和实施。 附图说明
图 1为本发明实施例 1提供的块级虚拟化存储设备上实现重复数据删除的 系统结构示意图;
图 2为本发明实施例 1块级虚拟化存储设备上实现重复数据删除的方法流 程图;
图 3为本发明实施例 1重复数据删除模块的结构示意图;
图 4为本发明实施例 1没有部署重复数据删除模块的系统结构示意图; 图 5为本发明实施例 1在部署重复数据删除模块后, 还未删除重复数据的 系统结构示意图;
图 6为本发明实施例 1在部署重复数据删除模块后, 部分数据已经去重的 系统结构示意图;
图 7为本发明实施例 1在重复数据删除后, 在线数据读、 写操作的系统结 构示意图;
图 8为本发明实施例 1将全局元数据池设备与虚拟 LUN设备合并统一管理 元数据的系统结构示意图;
图 9为本发明实施例 1虚拟 LBA地址空间与去重后数据段的对应关系示意 图; 系统结构示意图;
图 11为本发明实施例 1提供的统一元数据管理系统结构示意图。 具体实施方式
为了深入了解本发明, 下面结合附图及具体实施例对本发明进行详细说明。 目前在存储虚拟化层部署和实现重复数据删除功能主要集中在文件系统级 虚拟化存储设备范畴, 例如专利 WO2010/033961、 PCT/US2009/057772、 US 2009/0204649和 US2009/0204650中所记载的技术方案, 而在块级虚拟化存储设 备的虚拟化层上实现重复数据删除功能却没有记载和相关产品化实现。 另一方 面, 在块级虚拟化存储设备的虚拟化层上实现重复数据删除功能并不容易, 究 其原因在于:
1、 对一份实际数据的访问存在多条逻辑上独立的转换和指向路径, 即一份 实际数据对应多份为不同数据管理与操作功能服务的元数据 (如分别服务于存 储虚拟化和重复数据删除功能), 如果这些元数据的管理与更新没有同步和协 调, 可能导致数据访问混乱, 甚至丟失。
不同于传统的在主机层中部署重复数据的删除功能, 要在虚拟化存储设备 的虚拟化层实现重复数据删除功能, 不可避免地会出现对一份物理数据的访问 存在多条逻辑上独立的转换和重定向路径。 其一是: 虚拟 LUN 上虚拟 LBA ( Logical Block Address, 逻辑块地址)地址在主机层所展现的 "虚拟" 数据, 到物理存储设备上的实际数据的转换和指向路径; 其二是, 重复数据删除后, 去重后的数据段(即重复数据删除功能对应的 "虚拟" 数据)到其对应的数据 段引用的实际物理存放位置的转换和指向路径。 以上这些数据访问路径的转换 和指向信息, 在本发明中, 被称作虚拟 LBA地址和数据段元数据。
可以想象, 如果这些 "虚拟" 数据按照各自机制操作同一份实际数据并且 没有同步更新对应的元数据信息, 可能导致数据访问混乱。 举例而言, 在存储 设备层中某一份实际物理数据被映射到某虚拟 LUN提供的部分虚拟 LBA地址 段中 (即该物理数据包含在该虚拟 LBA地址段所映射的实际数据中), 那么当 该物理数据被删除重复数据后, 其在原存储位置(实际 LBA地址空间) 的数据 可能已经不完整(部分或者全部数据可能已经被合并到了对应的数据段引用 中), 那么这时,如果到达该虚拟 LUN上虚拟 LBA地址的 I/O请求被重定向到 该物理数据原实际 LBA地址空间, 会得到不完整或者无效数据。
2、 最小数据管理和操作单元不一致。
块级虚拟化存储设备管理的最小数据单元通常是存储介质管理的最小数据 单元, 该最小数据单元被称之为块(block ), 以磁盘为例, 大小通常是 512个字 节 (bytes),磁带等其他存储介质类似。传统的重复数据删除技术中通常是以字节
( byte )为最小操作单元, 对待去重数据分割和比较去重(理论上, 也可以以位
( bit ) 为最小单元对数据进行分割和比较去重)。
由于数据操作最小单元不一致, 使得重复数据删除技术不能在块级虚拟化 存储设备的虚拟化层直接应用。 具体而言, 在块级虚拟化存储设备上读写数据 是块为单位的, 以磁盘为例, 长度是 512个字节; 传统的数据去重技术中, 其待 去重数据通常是以一个字节为最小单位。 如果将重复数据删除技术直接应用于 块级虚拟化存储设备, 那么可能导致原本数据去重前存储在一个块中的数据在 数据去重后, 可能分别放到至少两个块中存储(如一个块中前半部分数据被放 置在一个数据段引用中, 后半部分数据被放置在另一个数据段引用中)。 这种拆 分虽然可以满足重复数据删除功能的设计目的一最好的数据去重效果, 但是会 导致存储虚拟化层从 "虚拟" 数据到实际数据指向路径的错乱, 主机层的数据 直接应用。
鉴于以上, 本发明提供了一种在块级虚拟化存储设备的虚拟化层上实现重 复数据删除的方法, 该方法通过获得虚拟 LBA地址空间到其对应的实际物理数 据去重后所得的数据段的对应关系, 进而根据该对应关系信息及所对应数据段 的元数据信息, 获取该虚拟 LBA地址空间对应的实际数据保存位置信息, 完成 I/O重定向。在本发明的具体实现中,需要设定重复数据删除最小数据操作单元。
需要说明的是, 在实际应用中, 块级虚拟化存储设备由于引入其它功能, 可能会在一定程度上影响数据的虚拟 LBA地址到其对应实际物理数据存放位置 的指向关系; 换言之, 二者可能不是典型存储虚拟化设备中直接指向关系, 而 是需要经过数次转换的间接指向关系, 比如有些块级虚拟化存储设备提供的虚 拟层 RAID, 或者多级虚拟化(为了提高虚拟地址空间容量)等多个虚拟 LUN 之间互相映射的系统设计。 然而无论哪种系统设计, 总可以获得指定虚拟 LUN 上指定的数据虚拟 LBA地址到其对应实际物理数据存放位置的指向信息。 另一 方面, 本发明所述方法和技术方案主要依赖于块级虚拟化存储设备所提供数据 虚拟 LBA地址到数据实际存放位置的指向信息, 与该指向信息在虚拟化存储设 备上如何获得并无直接关联, 所以以上不同的虚拟化存储设备的设计并不会影 响到本发明中所述技术方案的应用, 不影响本发明保护的范畴。 鉴于此, 以下 发明实施例的描述仅以典型存储虚拟化系统设计为例, 即数据的虚拟 LBA地址 到其对应实际物理数据存放位置的指向是直接指向关系。
另外, 本发明所述方法的实施过程中, 可以根据系统设计需要, 将重复数 据删除最小数据操作单元设定到块的整数倍级别、 字节 (byte)的整数倍级别或者 比特位 (bit)的整数倍级别。 然而设定到字节和比特位的整数倍级别, 虽然可以避 免过多空间的浪费, 但是却大大增加了元数据的数据量, 增加了元数据管理的 难度。 由于无论重复数据删除最小数据操作单元统一到何种级别, 仅关系到如 何实现重复数据删除功能的本身 (即如何对指定长度的数据进行划分和管理元 数据 ), 而不会影响到本发明的适用范围一在块级虚拟化存储设备的虚拟化层实 现重复数据删除的功能。 因此, 以下为了筒化本发明实施例说明, 仅以重复数 据删除最小数据操作单元设定到块级别 (即块的一倍级别) 为例。
最后,由于本发明所提出方法实现的核心在于获取数据虚拟 LBA地址空间 和该虚拟 LBA地址空间所对应实际物理数据去重后数据段的对应关系信息及去 重后数据段的元数据信息, 而传统的存储虚拟化和重复数据删除实现方法中, 以上信息通常是保存在存储虚拟化和重复数据删除两份元数据中的, 且管理和 更新由各自功能模块完成并没有同步机制, 比如有关虚拟 LBA地址的信息保存 在存储虚拟化的元数据中由存储虚拟模块负责管理和更新, 而有关数据段的信 息则保存在重复数据删除元数据信息中由重复数据删除模块负责管理和更新。 为了避免如上所述的元数据管理沖突, 可以采用至少两种系统实现本发明的设 计目的。 第一种系统, 即实施例 1 所阐述系统, 统一管理和更新全局元数据信 息, 服务于存储虚拟化和重复数据删除等功能的实现; 第二种系统, 即实施例 2 所阐述系统, 在整个系统级别的协调同步后, 服务于不同功能的元数据信息分 别由各自功能模块管理和更新。 以下分别阐述这两种系统的实现细节。
实施例 1 : 统一元数据管理系统
参见图 1 ,本发明实施例提供了一种块级虚拟化存储设备上实现重复数据删 除的统一元数据管理系统, 该系统包括:
虚拟 LUN设备, 用于存储虚拟化模块提供给前端主机挂载和使用的虚拟存 储设备;
重复数据删除模块, 用于删除指定虚拟 LBA地址空间所对应的实际物理数 据中的重复数据, 获得去重后的数据段;
存储虚拟化模块, 用于将外部数据读写 I/O请求的虚拟 LBA地址空间发送 给全局元数据管理模块, 以及接收全局元数据管理模块发送的虚拟 LBA地址空 间对应的实际物理数据的存放位置信息, 完成 I/O重定向;
全局元数据池设备, 用于存储全局元数据管理模块建立的对应关系信息及 重复数据删除模块获得的去重后数据段的元数据信息, 是一个与虚拟 LUN对应 的设备; 如果采用后期重复数据删除策略(如本发明实施例), 那么对于尚未删 除重复数据的虚拟 LBA地址空间, 全局元数据池设备中还会保存该虚拟 LBA 地址空间与实际物理数据存放位置的对应关系信息; 在具体实现中, 全局元据 池设备可以是以一个文件或数据库中的一张表等形式进行保存和维护;
全局元数据管理模块, 用于建立虚拟 LBA地址空间与去重后的数据段的对 应关系, 创建和初始化全局元数据池设备, 管理和更新全局元数据池设备中的 元数据, 以及根据接收到的虚拟 LBA地址空间、 对应关系和去重后的数据段的 元数据信息, 获取虚拟 LBA地址空间对应的实际物理数据的存放位置信息, 并 发送存放位置信息; 如果采用后期重复数据删除策略(如本发明实施例), 由于 外部 I/O所请求的虚拟 LBA地址空间对应的实际物理数据可能尚未去重, 那么 全局元数据管理模块直接返回存放在全局元数据池设备中该虚拟 LBA地址空间 对应的实际物理数据存放位置信息;
物理 LUN设备, 用于存放实际物理数据的存储设备, 通常是物理存储设备 层中一个较大的存储介质 (如磁盘阵列等)上划分出来的存储逻辑单元, 用逻 辑单元号 (即 LUN )进行标识。
进一步, 重复数据删除模块包括, 如图 3所示:
设置单元, 用于设置重复数据删除策略及重复数据删除最小数据操作单元; 重复数据删除最小数据操作单元可以设置为块的整数倍、 比特位的整数倍或字 节的整数倍。 获取单元, 用于获取指定虚拟 LBA地址空间对应的实际物理数据存放位置 信息;
提取单元, 用于根据从获取单元获取的实际物理数据存放位置信息, 按照 设置单元设置的重复数据删除最小数据操作单元, 从物理 LUN设备中提取用于 重复数据删除的指定长度数据;
分割单元, 用于根据设置单元设置的重复数据删除策略, 将提取单元提取 出的指定长度数据, 按照设置单元设置的重复数据删除最小数据操作单元, 分 割成指定大小的数据段;
数据指纹库单元, 用于存储数据指纹; 重复数据删除过程中, 通过新产生 的数据指纹与数据指纹库中的数据指纹比较, 从而实现重复数据删除功能; 重复数据删除单元, 用于计算分割单元分割的指定大小的数据段的数据指 纹, 并与数据指纹库单元存储的数据指纹进行比较, 发送比较结果;
元数据管理及更新单元, 用于接收比较结果, 并在比较结果为数据指纹相 同时,将元数据更新的内容和请求发给全局元数据管理模块, 由全局元数据管理 模块结合数据去重过程中数据读写的情况及信息, 更新每个去重后数据段的元 数据。
在实际应用中,全局元数据管理模块的功能还包括: 1 )负责在数据读写时, 协调数据读写进程和重复数据删除进程之间的沖突(如某虚拟 LBA地址所指向 的实际数据被数据读写进程和重复数据删除进程同时请求); 2) 与重复数据删除 模块交互, 负责更新全局元数据池设备中去重后数据段的元数据信息, 确保每 个虚拟 LBA地址对应的元数据信息的有效性和一致性。
在该系统中, 全局元数据池设备和全局元数据管理模块统一保存和管理整 个系统所有功能对应的元数据, 根据全局元数据池设备在整个系统所处的位置 不同, 整个系统可以有多种拓朴结构设计, 典型的如图 11和图 8所示; 图 11 中, 有一个独立于系统其它模块和设备的元数据存储设备(即全局元数据池设 备), 专用于保存和维护元数据, 服务于系统的各个功能; 图 8中, 则将全局元 数据池设备与虚拟 LUN设备合并。 然而无论哪种拓朴结构, 其实现方法相似。 下面以图 11的拓朴结构为例, 描述整个系统实现的细节。 在该拓朴结构中, 全 局元数据池设备由全局元数据管理模块统一管理和维护, 保存了整个系统所有 元数据, 服务于系统的各个功能。 为了筒化说明, 本实施例中仅以存储虚拟化 和重复数据删除功能为例, 其他功能如 RAID等, 因实现方法类似, 这里不再 赘述; 其它拓朴结构中, 也将有与全局元数据管理模块功能类似的模块和机制, 维护和管理元数据, 因实现方式类似, 这里亦不做讨论。
在具体实践中, 块级虚拟化存储设备的虚拟化有多种实现方式, 典型的有 带内架构(in-band architecture),主要的商业化产品有 IBM SAN Volume Controller (SVC), IBM DS8000系列、 Hitachi VSP系列、 EMC VPLEX、 DataCore SAN symphony- V, 带夕卜架构(out-of-band architecture) , 主要的商业化产品有 EMC Invista等。 但是无论哪种实现方式, 其核心思想都是创建虚拟 LUN供前端主机 挂载和使用, 将虚拟 LUN上的虚拟 LBA地址空间映射及转换到对应真实数据 所存放的物理位置, 实现到达虚拟 LUN上数据读写 I/O的重定向。 由于本发明 所述方法的实现主要依赖于虚拟化层的虚拟 LUN及其元数据, 不会涉及到上述 实现方式的差异(如数据路径(data path )和控制路径(control path )是否分离 影响本发明的适用范围。 为了筒化本发明的可行性描述, 本发明实施例以带内 块级虚拟化存储设备的虚拟化实现为例来说明。
另一方面, 在具体实现中, 重复数据删除技术也有多种实现方式, 典型的 有定长 (fixed-length dedup )、 不定长 (variable-length dedup ) 和混合长度 ( hybrid-length dedup )„ 但是无论哪种实现方式, 其核心思想都是将指定长度的 数据按照预定的算法划分出符合要求大小的数据段, 通过计算这些数据段的指 纹, 比较去掉重复数据, 保留一份数据段引用。 通过每个数据段的元数据, 完 成所有到达指定数据段数据读写 I/O的重定向。由于重复数据删除技术的不同实 现方式只会影响到有关重复数据删除性能和效果等方面, 而不会影响本发明的 可行性, 因此也不会影响到本发明对上述重复数据删除解决方案的适用性。 为 了筒化本发明可行性的描述, 本发明实施例以变长重复数据删除技术为例来说 明, 定长的重复数据删除可以看作是变长重复数据删除实现的一个特例。
另外, 根据数据去重的时机, 重复数据删除方案又可以划分为在线实时数 据去重 (in-line dedup)和后期去重 (post-processing dedup)。 同样, 由于这两种方案 仅会影响到整体系统性能和数据去重效果等方面, 不会影响到本发明的可行性, 所以也不会影响到本发明对上述数据去重解决方案的适用性。 为了筒化本发明 可行性的描述, 本发明实施例以后期去重 (post-processing) 解决方案为例来说 明。
同时, 由于本发明的技术创新点在于将重复数据删除解决方案应用于块级 虚拟化存储设备的虚拟化层上, 而不是讨论如何进行重复数据删除; 并且, 重 复数据删除技术已经成熟, 且已大规模商业应用。 所以, 本发明实施例中有关 重复数据删除技术的实现细节如数据分割算法、 数据指纹的计算和比较等细节 将被略去, 不做深入阐释。 重复数据删除功能为讨论基础。
为了方便实施步骤描述, 下面给出本发明实施例中的一些技术名词解释: 1. 块(block ) — 存储介质管理的最小数据单元, 一个块是连续的若干个 字节或者比特位 ( a sequence of bytes or bits ),通常有固定的长度,以磁盘为例, 大小通常是 512个字节, 磁带等其他存储介质类似。
2. 数据段 (data extent)—用于描述重复数据删除功能的概念, 是指重复数据 删除功能模块在删除重复数据前, 按照预定算法 (不同的重复数据删除方案的 数据段划分方法也不同)将指定长度的数据划分成多个符合要求大小的数据段; 通过计算这些数据段的指纹, 比较它们的异同, 实现删除重复数据。 重复数据 删除后, 数据段则表示一个逻辑概念, 通过其对应的数据段元数据信息, 指向 保存在其对应的数据段引用中实际物理数据。
3. 数据段引用( data extent reference )一用于描述重复数据删除功能的概念, 是指在重复数据删除后, 对于内容重复的数据段, 仅保存一份它们的物理数据 在指定存储介质上, 且建立这些数据段到该份唯一物理数据拷贝的引用关系, 这里被多个数据段所引用的唯一物理数据拷贝, 称作这些数据段对应的数据段 引用。
4. 数据段元数据 (data extent metadata)—用于描述重复数据删除功能的概 念, 是指数据去重后, 所保存的数据段与其对应的数据段引用存放地址的引用 信息(也称指向信息或指针信息); 该信息中还包含该数据段引用所保存的实际 位置信息 (如 LUN所在物理设备位置和 LUN上对应的 LBA地址等信息)。 数 据去重后, 所有到达数据段的 I/O都会根据该数据段对应的元数据重定向到其对 应的数据段引用。
5. 虚拟 LBA地址的元数据 (virtual LBA address metadata)—服务于存储虚 拟化数据访问 I/O重定向功能, 是指用于从指定虚拟 LBA地址重定向到实际数 据存储位置的信息。 该元数据信息可以根据系统的设计需要, 包含不同的信息, 如在虚拟层如果实现软件 RAID或者多级虚拟化, 那么该元数据将包含在加入 这些功能之后, 指定虚拟 LBA地址重定向到实际数据保存位置所必需的信息。 以本实施例而言, 该元数据将包含以下信息: 指定的虚拟 LBA地址所对应的实 际数据是否已经去重, 如果已经去重, 其所对应的数据段及相对数据段头部的 偏移量; 如果没有去重, 该虚拟 LBA地址所对应的实际数据存放位置的指向信 息。
6. 虚拟 LUN元数据 (virtual LUN metadata) 一主要指虚拟 LUN所包含的虚 拟 LBA地址元数据的集合。 现实中, 该元数据可以以一个文件或数据库中的一 张表等形式进行保存和维护。
7. 存者虚拟化元数据 (storage virtualization metadata)—主要包括至少一个 虚拟 LUN元数据及为虚拟 LUN的其他功能(如 RAID等)提供支持的信息。
8. 重复数据删除元数据 (data dedup metadata) 一主要包括数据段的元数据 及必要的支持元数据维护功能信息 (如元数据存放的空间规划与部署等)。
参见图 1和图 2,基于统一元数据管理系统, 本发明实施例提供了一种块级 虚拟化存储设备上实现重复数据删除的方法, 包括以下步骤:
步骤 101: 在块级虚拟化存储设备的虚拟化层, 部署重复数据删除模块和全 局元数据管理模块, 为指定虚拟 LUN创建全局元数据池设备且将之初始化; 根据实际系统需求, 例如性能、 功能和重复数据删除比例目标等, 选择数 据去重方案, 进而根据所选择的数据去重方案, 部署相应的重复数据删除模块; 如上所述, 本实施例选择目前主流的变长、 后期数据去重方案;
在重复数据删除模块部署之后, 还要制定相应的重复数据删除策略, 包括: 设定重复数据删除引擎的启动时间 (如在数据读写不频繁的晚上)、 设定数据去 重空间回收的时间及周期等等; 重复数据删除策略的制定, 往往与重复数据删 除模块的功能设计有关, 不同的数据去重方案可能导致其对应的重复数据删除 策略不同;
在部署完重复数据删除模块后, 再部署全局元数据管理模块; 然后, 由全 局元数据管理模块对指定虚拟 LUN创建一个对应的全局元数据池, 在具体实现 中, 可以为每个虚拟 LUN创建一个独占的全局元数据池, 也可以使之与其他虚 拟 LUN共用一个全局元数据池; 由于两者实现方法相似, 因此本发明实施例仅 以为每个虚拟 LUN创建一个独占的全局元数据池为例进行阐述;
在全局元数据池建立以后, 全局元数据管理模块需要对其进行初始化, 具 体步骤如下: 1 )针对一个确定的虚拟 LUN,创建一个全局元数据池 Dedup vLUN, 全局元数据管理模块通过存储虚拟化模块获取该虚拟 LUN上虚拟 LBA地址空 间及虚拟 LB A地址空间到已分配的实际 LB A地址空间指向信息, 并将之—— 复制到对应的 Dedup vLUN上;换言之,此时在虚拟 LUN上每个确定的虚拟 LBA 地址, 都可以在 Dedup vLUN找到相同的虚拟 LBA地址和对应该虚拟 LBA地 址相同的到实际物理数据存放位置指向信息; 如果虚拟 LUN对应的实际 LBA 地址空间是动态分配的 (比如在使用精筒配置的情况下), 那么就在其分配后, 将以上信息复制到 Dedup vLUN上; 2 )初始状态下,全局元数据池中的虚拟 LBA 地址所对应的实际物理数据都未去重, 使用 "未去重" 状态标识标记这些虚拟 LB A地址的元数据;
在全局元数据管理模块和全局元数据池部署后,当有数据访问 I/O到达虚拟 LUN上确定的虚拟 LBA地址时, 存储虚拟化模块需要将该虚拟 LBA地址传输 给全局元数据管理模块, 由全局元数据管理模块返回实际物理数据存放的位置 信息给存储虚拟化模块, 由存储虚拟化模块完成 I/O重定向;
对比图 4和图 5, 可以反映出步骤 101完成前后的变化: 图 4是没有部署重 复数据删除功能模块的系统结构示意图, 从图 4 中可以看出, 存储虚拟化就是 将虚拟 LUN上的虚拟 LBA地址映射到实际 LUN (如图 4中的 LUN A, LUN B, LUN C ) 的实际 LBA地址, 完成主机端发送过来的 I/O请求的重定向; 图 5是 在部署重复数据删除功能模块后尚未删除重复数据的系统示意图, Dedup vLUN 是对应于虚拟 LUN的全局元数据池;
在步骤 101中初始化完成后, 虚拟 LUN的虚拟 LBA地址空间 (通过全局 元数据管理模块)将和 Dedup vLUN的虚拟 LBA地址空间——对应,并且 Dedup vLUN还保存了对应于这些虚拟 LBA地址空间的实际物理数据存放位置信息; 步骤 102:设置单元设置重复数据删除最小数据操作单元和重复数据删除策 略, 根据重复数据删除策略, 删除指定虚拟 LBA地址空间所对应的实际物理数 据中的重复数据, 获得物理数据去重后的数据段;
需要说明的是, 本发明实施例中的虚拟 LBA地址空间为一段虚拟 LBA地 址段, 包含若干连续或者不连续的虚拟 LBA地址;
设置单元将重复数据删除最小数据操作单元统一到块级别, 使之与存储介 质的最小数据单元一致;
根据设置单元设置的重复数据删除策略, 删除指定虚拟 LBA地址空间所对 应的实际物理数据中的重复数据, 获得物理数据去重后的数据段, 具体包括以 下子步骤: 1 )重复数据删除模块中的获取单元在与全局元数据管理模块交互后, 获取未被去重的指定虚拟 LBA地址空间及其对应的实际物理数据存放位置信 息; 2 )根据获取单元所获取的虚拟 LBA地址空间所对应的实际物理数据存放 位置信息, 重复数据删除模块中的提取单元从该实际物理数据存放位置信息指 定的物理位置按照块的边界, 提取用于重复数据删除的指定长度数据, 即所提 取数据的起始和终止位置必须是块的边界, 该所提取数据长度是块长度的整数 倍; 3 )根据设置单元设置的重复数据删除策略, 重复数据删除模块中的分割单 元将提取出的指定长度数据以块为最小单位, 分割成指定大小的数据段(每个 切割后的数据段也是由至少一个完整的块组成); 4 ) 重复数据删除模块中的重 复数据删除单元计算分割的指定大小的数据段的数据指纹, 并与数据指纹库单 元存储的数据指纹进行比较去重, 获得指定虚拟 LBA地址空间对应的物理数据 去重后的数据段;
在步骤 1 ) 中, 全局元数据管理模块需要根据所保存的虚拟 LBA地址空间 的元数据中关于指定虚拟 LBA地址是否已经去重的信息, 以及存储虚拟化模块 的 I/O请求情况, 选定一段未被数据读写进程占用的虚拟 LBA地址, 交予重复 数据删除模块进行重复数据删除;
步骤 103: 更新去重后的数据段的元数据, 建立虚拟 LBA地址空间与去重 后数据段的对应关系, 以及更新虚拟 LBA地址空间所含虚拟 LBA地址的元数 据;
在步骤 102 完成后, 根据数据去重后的结果, 重复数据删除模块中的元数 据管理及更新单元将元数据更新的内容和请求发给全局元数据管理模块, 全局 元数据管理模块将综合数据去重过程中数据读写的情况及信息, 更新每个去重 后数据段的元数据;
进一步, 根据数据去重的情况, 全局元数据管理模块建立用于数据去重的 虚拟 LBA地址空间与其对应实际物理数据去重后数据段的对应关系; 如图 9所 示, 数据的虚拟 LB A地址空间对应于物理 LUN上的实际 LB A地址空间, 实际 LBA地址空间保存的实际物理数据去重后, 得到了数据段 DE1、 DE2、 DE3, 它 们分别指向于数据段引用 DI 1、 DI 2、 DI 1; 从图 9中可以看出, 通过对相同的 数据去重前实际 LBA地址空间的指向和对应关系, 可以将虚拟 LBA地址空间 中每个虚拟 LBA地址与数据段 DEI、 DE2、 DE3中的每个块——对应起来(因 为这里重复数据删除最小数据操作单元是块, 与存储介质的最小数据管理单元 一致 ), 图中以双箭头表达了这种对应关系, 即 ¥!^ 与 DE2中 c2是对应的; 待这种对应关系建立起来以后, 指定虚拟 LB A地址的元数据将更新为, 该 虚拟 LBA地址所指向的实际物理数据是否已经去重的标识; 如果已经去重, 元 数据还包括其所对应的数据段及相对数据段头部的偏移量; 如果没有去重 (可 能在数据去重过程中该虚拟 LBA地址对应的实际物理数据被写, 那么该虚拟 LBA地址对应的实际物理数据去重过程无效, 具体详见步骤 104 ), 则元数据包 括该虚拟 LBA地址所对应的实际物理数据存放位置的指向信息;
在元数据更新后, 还需定期回收重复数据删除后释放出来新的物理空间, 该物理空间回收的发起和执行在不同的系统设计中可能有不同的选择, 比如, 整个物理空间的管理可以由存储虚拟化模块负责, 其空间的回收也可以由它发 起, 由重复数据删除模块完成;
对比图 5和图 6, 可以看出步骤 102和 103完成前后的变化: 图 5是在部署 重复数据删除功能模块后尚未删除重复数据的系统示意图, Dedup vLUN是对应 于虚拟 LUN的全局元数据池; 图 6是在重复数据删除模块部署后, 部分数据已 经去重的系统示意图, 数据去重后的数据段以 (1= 1, 2, ..., 8,...n, n是自然数) 表示, 其对应的每个数据段的长度(即其对应的数据段引用实际 LBA地址的长 度) 以&(1=1, 2, ..., 8,...n, n是自然数)表示, 对于变长的重复数据删除技术而 言, 每个数据段的长度可能不同; 为了描述方便, 本实施例在存储介质上创建 了一个名为 "Dedup LUN" 的物理 LUN设备, 用于存放数据去重后数据段所对 应的数据段引用; 需要指出的是, 本实施例中重复数据删除最小数据操作单元 已经设置到块级别, 所以 &是存储介质块长度的整数倍, 每个数据段对应的数 据段引用也是由数个完整的块组成的; 此时, Dedup vLUN除了保存了一份与虚 拟 LUN—致的虚拟 LBA地址空间, 还需要保存每个虚拟 LBA地址对应的元数 据信息及去重后数据段的元数据信息;
步骤 104: 对到达虚拟 LUN上某确定虚拟 LBA地址空间的数据读写 I/O请 求, 根据所保存的该虚拟 LBA地址空间与去重后数据段的对应关系及数据段的 元数据信息, 获取实际物理数据的存放位置信息, 完成虚拟化存储设备数据读 写 I/O的重定向;
需要说明的是, 出于一般性考虑,本步骤的设计主要以去重后数据 I/O的重 定向为讨论基础,这也是本发明尝试解决的核心问题,对于外部数据 I/O访问的 虚拟 LBA地址对应的实际物理数据尚未去重的情况, 如采取后期去重的重复数 据删除策略(如本发明实施例), 与未部署重复数据删除功能的虚拟化存储设备 类似, I/O重定向主要依据的是预存在虚拟 LBA地址元数据中该虚拟 LBA地址 与实际物理数据保存位置的对应信息, 本发明实施例中关于指定虚拟 LBA地址 对应的实际物理数据是否去重的信息保存在虚拟 LBA地址的元数据中备索; 当有外部数据访问 I/O请求到达指定虚拟 LBA地址上时, 存储虚拟化模块 将该虚拟 LBA地址发送给全局元数据管理模块, 全局元数据管理模块根据该虚 拟 LBA地址对应的元数据信息, 判断该虚拟 LBA地址对应的实际物理数据是 否已经被去重, 如果未被去重, 则返回该虚拟 LBA地址对应的实际物理数据存 放位置信息给存储虚拟化模块; 如果已经去重, 根据该虚拟 LBA地址的元数据 信息(所对应的数据段及相对数据段头部的偏移量), 及所对应数据段的元数据 信息(包含了其对应的数据段引用的实际存放位置信息), 通过以下计算(参看 图 6 ), 获取实际物理数据的存放位置信息, 返回给存储虚拟化模块:
假定主机数据读写 I/O申请的虚拟 LBA地址 vLa在 Dedup vLUN上对应的 物理数据已经去重, 对应的是去重后数据段 ck中离头部偏移量 rLa的位置, 因 本发明实施例中, 重复数据删除最小数据操作单元是块级别, 所以 rLa即为 vLa 在 ck中对应位置相对其头部的相对 LBA地址长度,所需获取的 vLa对应的实际 数据存放位置 pLa, 其实是 ck对应的数据段引用中某实际 LBA地址, 可以通过 公式( 1 )获得:
pLa =pAddrks + rLa ( 1 )
其中, (1(1 是数据块 对应的数据段引用保存物理位置的起始 LBA地 址, 该信息是数据去重后保存在数据段元数据中的已知信息; 同时, rLa也是在 数据去重过程中保存在虚拟 LBA地址元数据中的已知信息, 所以, 通过以上计 算, 可以获取确定虚拟 LBA地址 vLa对应的实际数据存放位置信息 pLa;
在获取全局元数据管理模块返回的实际数据存放位置信息后, 存储虚拟化 模块便可以完成到达虚拟 LUN数据读写 I/O重定向和数据的实际读写, 具体包 括以下几种情况:
1、 重复数据删除前, 数据的读写操作;
在全局元数据管理模块创建和初始化 Dedup vLUN后, 所有虚拟 LBA地址 的元数据中已经包含了其对应的实际物理数据的存放位置信息;
因重复数据删除前, 所有到达虚拟 LUN上确定虚拟 LBA地址的数据读写 I/O请求, 全局元数据管理模块直接返回预先保存的该虚拟 LBA地址对应的实 际物理数据存放位置信息给存储虚拟化模块,进而由存储虚拟化模块完成 I/O的 重定向, 整个过程与无重复数据删除功能的虚拟化存储设备基本一致, 所以这 里不再赘述细节;
2、 重复数据删除后, 数据的读写操作;
数据去重后, 虚拟 LUN或者 Dedup vLUN上将有至少一部分的虚拟 LBA 地址对应的实际物理数据被重构到去重后的数据段中, 这种变化使得虚拟 LBA 地址的转换机制与传统存储虚拟化有所不同,但是对于主机层面的数据 I/O访问 则是完全透明的;
1 )在线数据读操作;
重复数据删除后, 数据的读操作过程与重复数据删除前的数据读操作有所 不同, 如图 7所示: 假设有外部读 I/O请求被派送到了虚拟 LUN上的一段虚拟 LBA地址(即要访问 1^到 bn所映射的物理数据 ), 该段虚拟 LBA地址的数据 读请求由存储虚拟化模块发送给了全局元数据管理模块, 全局元数据管理模块 对应的去重后的数据段是 c2到 c6之间的部分数据(即从 c2的第二个块到 c6的第 二个块之间的块所对应的数据), 通过上述虚拟 LB A地址的转换过程后, 获知 其对应的实际数据存放的 LBA地址(可能不连续) 并返回给存储虚拟化模块, 存储虚拟化模块进而从指定的物理位置提取数据, 返回给外部数据读 I/O请求;
2 )在线数据写操作;
重复数据删除后, 数据的写操作过程与重复数据删除前的数据写操作有所 不同, 如图 7所示: 假设有外部写 I/O请求被派送到了虚拟 LUN上的一段虚拟 LBA地址(即要访问 ^ 到 bn所映射的物理数据), 进而存储虚拟化模块将该 虚拟 LBA地址段的写请求发给了全局元数据管理模块, 全局元数据管理模块发 的去重后的数据段是 c2到 c6之间的部分数据(即从 c2的第二个块到 c6的第二个 块之间的块所对应的数据 ); 那么,
( 1 )全局元数据管理模块通过存储虚拟化模块在后端存储介质上将为该次 写 I/O分配新的存储空间, 并将新存储空间位置信息返回给存储虚拟化模块,存 储虚拟化模块进而将外部写 I/O重定向到新分配的存储位置, 将数据写入;
( 2 )全局元数据管理模块通过存储虚拟化模块在后端存储介质上分配新的 存储空间, 由重复数据删除模块将该次写 I/O未影响的数据段中块(即 c2的第 一个块及 c6的第三个块)在数据段引用中对应的实际数据拷贝到新分配的存储 位置, 保存起来;
( 3 )全局元数据管理模块更新全局元数据池中数据段 c2~c6对应的虚拟 LBA 地址段的元数据信息: ①更新该次写 I/O所影响到的在 Dedup vLUN上的虚拟 LBA地址段元数据信息, 将其对实际数据存放位置的指向信息更新为第(1 )步 中新分配的数据存储位置; ②更新该次写 I/O 所关联数据段中未必影响的在 Dedup vLUN上的虚拟 LBA地址段元数据, 即 c2的第一个块及 c6的第三个块所 对应的虚拟 LBA地址段的元数据, 将其对实际数据存放位置的指向信息更新到 第 (2 )步中它们实际数据拷贝存放的位置; ③将数据段 到 c6在 Dedup vLUN 上所对应的虚拟 LBA地址段(要大于该次写 I/O所影响的虚拟 LBA地址段)标 记为 "未去重" 状态, 重复数据删除模块随后按照预定的重复数据删除策略将 对之做去重处理;
( 4 )根据预置的策略, 定期回收(如果没有其他数据段指向该物理数据) 存放在 Dedup LUN上原 到 c6之间的块所指向的数据段引用占用物理空间; 3、 重复数据删除过程中, 数据的读写操作;
这种情况是沖突的协调问题, 由全局元数据管理模块负责; 在重复数据删 除过程中, 因全局元数据池中虚拟 LB A地址的元数据尚未更新, 所以数据读写 I/O过程中将对所涉及的虚拟 LBA地址的元数据更新将被全局元数据管理模块 锁定;
如果是数据读 I/O, 那么在该 I/O完成后,可以允许所涉及虚拟 LBA地址的 元数据更新, 即将之对实际数据位置的指向信息, 更新为, 该虚拟 LBA地址对 应实际数据已经去重, 及其所对应的数据段及相对数据段头部的偏移量;
如果是数据写 I/O, 需要根据重复数据删除的进展情况决定采取相应的措 施: 如果重复数据删除进程尚未完成, 那么需要将重复数据删除进程(仅针对 该写 I/O关联虚拟 LBA地址段的重复数据删除任务)暂时挂起, 待正常数据写 操作完成后, 再重新启动(需要更新重复数据删除目标数据); 如果重复数据删 除已经完成, 需要更新对应虚拟 LBA地址(该虚拟 LBA地址长度可能大于该 次写 I/O所影响的虚拟 LBA地址长度) 的元数据, 那么需要将该次写 I/O请求 所关联的去重后数据段对应的全部虚拟 LBA地址段的元数据标记为未去重, 保 留其对实际数据存放位置的指向信息, 待以后根据重复数据删除策略, 再删除 重复数据。
实施例 2 : 元数据分制式管理系统
该系统与实施例 1的区别在于: 该系统中没有一个类似于实施例 1的全局 元数据池设备统一保存和管理整个系统的元数据, 取而代之的是虚拟 LBA地址 的元数据和数据去重后数据段的元数据分别由存储虚拟化模块和重复数据删除 模块各自负责管理与更新, 如图 10所示。 但是, 这两份元数据的内容与实施例 1基本相同。 同时为了保证元数据的一致性, 全局元数据管理模块在整个系统中 发挥的作用与实施例 1 不再相同, 即不再是主要负责初始化全局元数据池设备 和统一元数据管理与更新, 而是专注于存储虚拟化和重复数据删除模块的元数 据更新的同步协调与交互。
参见图 10, 本发明实施例还提供了一种块级虚拟化存储设备上实现重复数 据删除的元数据分制式管理系统, 该系统包括:
虚拟 LUN设备, 用于提供给前端主机挂载和使用; 存储虚拟化元数据池设备,用于存储虚拟 LBA地址空间对应的元数据信息; 重复数据删除元数据池设备, 用于存储重复数据删除模块去重后的数据段 的元数据信息;
重复数据删除模块, 用于删除指定虚拟 LBA地址空间所对应的实际物理数 据中的重复数据, 获得去重后的数据段, 以及更新重复数据删除元数据池设备 中的元数据信息;
全局元数据管理模块, 用于建立虚拟 LBA地址空间与去重后的数据段的对 应关系, 以及同步协调存储虚拟化模块和重复数据删除模块的元数据的更新及 交互;
存储虚拟化模块, 用于根据全局元数据管理模块建立的对应关系和重复数 据删除模块去重后的数据段的元数据信息, 获取外部数据读写请求指向的虚拟 LBA地址空间对应的实际物理数据的存放位置信息, 完成 I/O重定向, 以及更 新存储虚拟化元数据池设备中的元数据信息;
物理 LUN设备, 用于存放实际物理数据。
进一步, 重复数据删除模块包括:
设置单元, 用于设置重复数据删除策略及重复数据删除最小数据操作单元; 重复数据删除最小数据操作单元为块的整数倍、 比特位的整数倍或字节的整数 倍;
获取单元, 用于获取指定虚拟 LBA地址空间对应的实际物理数据存放位置 信息;
提取单元, 用于根据从获取单元获取的实际物理数据存放位置信息, 按照 设置单元设置的重复数据删除最小数据操作单元, 从物理 LUN设备中提取用于 重复数据删除的指定长度数据; 分割单元, 用于根据设置单元设置的重复数据删除策略, 将提取单元提取 出的指定长度数据, 按照设置单元设置的重复数据删除最小数据操作单元, 分 割成指定大小的数据段;
数据指纹库单元, 用于存储数据指纹; 重复数据删除过程中, 通过新产生 的数据指纹与数据指纹库中的数据指纹比较, 从而实现重复数据删除功能; 重复数据删除单元, 用于计算分割单元分割的指定大小的数据段的数据指 纹, 并与数据指纹库单元存储的数据指纹进行比较, 发送比较结果;
元数据管理及更新单元, 用于接收比较结果, 并在比较结果为数据指纹相 同时, 通过全局元数据管理模块的协调, 更新去重后数据段的元数据, 发送给 重复数据删除元数据池设备。
本实施例与实施例 1系统的区别还体现在如下几点:
1 )元数据的保存与更新
虚拟 LBA地址元数据和去重后数据段元数据的保存位置不再是全局元数据 池设备, 而是由存储虚拟化元数据池设备和重复数据删除元数据池设备分别存 储; 元数据的更新也不是由全局元数据管理模块完成, 而是分别由存储虚拟化 和重复数据删除模块完成; 但是元数据内容及元数据更新过程中全局元数据管 理模块的同步协调机制与实施例 1基本相同。
2 )元数据的获取
获取指定虚拟 LBA地址元数据的请求, 由存储虚拟化模块在与全局元数据 管理模块交互后, 从存储虚拟化元数据池设备中获取; 存储虚拟化模块根据虚 拟 LBA地址元数据信息, 如需获取指定数据段元数据, 将该请求发送给全局元 数据管理模块, 由全局元数据管理模块在与重复数据删除模块交互后, 由重复 数据删除模块从重复数据删除元数据池设备中获得, 并由全局元数据管理模块 最终返回给存储虚拟化模块; 该过程中所需获取元数据的内容与实施例 1相似。 基于元数据分制式管理架构, 本实施例提供的块级虚拟化存储设备上实现 重复数据删除的方法与实施例 1区别如下:
步骤 101' : 在块级虚拟化存储设备的虚拟化层, 部署重复数据删除模块和 全局元数据管理模块;
与实施例 1不同,本步骤中全局元数据管理模块不需要创建和初始化全局元 数据池设备; 除了本步骤, 本实施例的其他实施细节与实施例 1基本一致, 这 里不再赘述。
本发明实施例提供的技术方案可以跨主机和存储设备删除重复数据, 实现 更大范围的重复数据删除; 本发明实施例提供的技术方案不占用主机系统资源, 从而保证了主机上运行的业务程序可以平滑运行; 本发明实施例提供的技术方 案可以集中管理和保护重复数据删除功能的元数据, 筒化整个系统设计和实施。
以上所述的具体实施方式, 对本发明的目的、 技术方案和有益效果进行了 进一步详细说明, 所应理解的是, 以上所述仅为本发明的具体实施方式而已, 并不用于限制本发明, 凡在本发明的精神和原则之内, 所做的任何修改、 等同 替换、 改进等, 均应包含在本发明的保护范围之内。

Claims

权 利 要 求 书
1、 一种块级虚拟化存储设备上实现重复数据删除的方法, 其特征在于, 所 述方法包括:
删除指定虚拟 LBA地址空间所对应的实际物理数据中的重复数据, 获得所 述物理数据去重后的数据段;
建立所述虚拟 LBA地址空间与所述物理数据去重后的数据段的对应关系; 根据所述对应关系和数据段的元数据信息, 获取外部数据读写请求指向的 虚拟 LBA地址空间对应的实际物理数据的存放位置信息, 完成 I/O重定向。
2、 如权利要求 1所述的块级虚拟化存储设备上实现重复数据删除的方法, 其特征在于, 在所述删除指定虚拟 LBA地址空间所对应的实际物理数据中的重 复数据的步骤之前还包括: 设置重复数据删除策略及重复数据删除最小数据操 作单元。
3、 如权利要求 2所述的块级虚拟化存储设备上实现重复数据删除的方法, 其特征在于, 所述删除指定虚拟 LBA地址空间所对应的实际物理数据中的重复 数据的步骤具体包括:
根据所述重复数据删除最小数据操作单元, 从虚拟 LBA地址空间对应的实 际物理数据中提取用于重复数据删除的指定长度数据;
根据所述重复数据删除策略, 将所述指定长度数据按照所述重复数据删除 最小数据操作单元, 分割成指定大小的数据段;
计算所述指定大小的数据段的数据指纹, 并与数据指纹库中存储的数据指 纹进行比较, 根据数据指紋相同的比较结果, 删除实际物理数据中的重复数据。
4、 如权利要求 3所述的块级虚拟化存储设备上实现重复数据删除的方法, 其特征在于, 所述获得所述物理数据去重后的数据段的步骤还包括: 更新所述 物理数据去重后的数据段的元数据。
5、 如权利要求 4所述的块级虚拟化存储设备上实现重复数据删除的方法, 其特征在于, 所述重复数据删除最小数据操作单元为块的整数倍、 比特位的整 数倍或字节的整数倍。
6、 如权利要求 1-5中任一所述的块级虚拟化存储设备上实现重复数据删除 的方法, 其特征在于, 所述块级虚拟化存储设备的结构为带内或者带外体系架 构。
7、 一种块级虚拟化存储设备上实现重复数据删除的系统, 其特征在于, 所 述系统包括:
虚拟 LUN设备, 用于提供给前端主机挂载和使用;
重复数据删除模块, 用于删除指定虚拟 LBA地址空间所对应的实际物理数 据中的重复数据, 获得去重后的数据段;
全局元数据管理模块, 用于建立所述虚拟 LBA地址空间与所述去重后的数 据段的对应关系, 管理和更新全局元数据池设备中的元数据, 以及根据接收到 的虚拟 LBA地址空间、 所述对应关系和去重后的数据段的元数据信息, 获取所 述虚拟 LBA地址空间对应的实际物理数据的存放位置信息, 并发送所述存放位 置信息;
全局元数据池设备, 用于存储所述全局元数据管理模块建立的对应关系信 息及所述重复数据删除模块获得的去重后数据段的元数据信息;
存储虚拟化模块, 用于将外部数据读写 I/O请求的虚拟 LBA地址空间发送 给所述全局元数据管理模块, 以及接收所述全局元数据管理模块发送的所述虚 拟 LBA地址空间对应的实际物理数据的存放位置信息, 完成 I/O重定向;
物理 LUN设备, 用于存放实际物理数据。
8、 如权利要求 7所述的块级虚拟化存储设备上实现重复数据删除的系统, 其特征在于, 所述重复数据删除模块包括:
设置单元, 用于设置重复数据删除策略及重复数据删除最小数据操作单元; 获取单元, 用于获取指定虚拟 LBA地址空间对应的实际物理数据存放位置 信息; 按照所述设置单元设置的重复数据删除最小数据操作单元, 从所述物理 LUN设 备中提取用于重复数据删除的指定长度数据;
分割单元, 用于根据所述设置单元设置的重复数据删除策略, 将所述提取 单元提取出的指定长度数据, 按照所述设置单元设置的重复数据删除最小数据 操作单元, 分割成指定大小的数据段;
数据指纹库单元, 用于存储数据指纹;
重复数据删除单元, 用于计算所述分割单元分割的指定大小的数据段的数 据指纹, 并与所述数据指纹库单元存储的数据指纹进行比较, 发送比较结果; 元数据管理及更新单元, 用于接收所述比较结果, 并在所述比较结果为数 据指紋相同时, 将元数据更新的内容和请求发送给所述全局元数据管理模块。
9、 如权利要求 8所述的块级虚拟化存储设备上实现重复数据删除的系统, 其特征在于, 所述重复数据删除最小数据操作单元为块的整数倍、 比特位的整 数倍或字节的整数倍。
10、 一种块级虚拟化存储设备上实现重复数据删除的系统, 其特征在于, 所述系统包括:
虚拟 LUN设备, 用于提供给前端主机挂载和使用;
存储虚拟化元数据池设备,用于存储虚拟 LBA地址空间对应的元数据信息; 重复数据删除元数据池设备, 用于存储重复数据删除模块去重后的数据段 的元数据信息;
重复数据删除模块, 用于删除指定虚拟 LBA地址空间所对应的实际物理数 据中的重复数据, 获得去重后的数据段, 以及更新所述重复数据删除元数据池 设备中的元数据信息;
全局元数据管理模块, 用于建立所述虚拟 LBA地址空间与所述去重后的数 据段的对应关系, 以及同步协调存储虚拟化模块和重复数据删除模块的元数据 的更新及交互;
存储虚拟化模块, 用于根据所述全局元数据管理模块建立的对应关系和所 述重复数据删除模块去重后的数据段的元数据信息, 获取外部数据读写请求指 向的虚拟 LBA地址空间对应的实际物理数据的存放位置信息,完成 I/O重定向, 以及更新所述存储虚拟化元数据池设备中的元数据信息;
物理 LUN设备, 用于存放实际物理数据。
11、如权利要求 10所述的块级虚拟化存储设备上实现重复数据删除的系统, 其特征在于, 所述重复数据删除模块包括:
设置单元, 用于设置重复数据删除策略及重复数据删除最小数据操作单元; 获取单元, 用于获取指定虚拟 LBA地址空间对应的实际物理数据存放位置 信息; 按照所述设置单元设置的重复数据删除最小数据操作单元, 从所述物理 LUN设 备中提取用于重复数据删除的指定长度数据;
分割单元, 用于根据所述设置单元设置的重复数据删除策略, 将所述提取 单元提取出的指定长度数据, 按照所述设置单元设置的重复数据删除最小数据 操作单元, 分割成指定大小的数据段;
数据指纹库单元, 用于存储数据指纹;
重复数据删除单元, 用于计算所述分割单元分割的指定大小的数据段的数 据指纹, 并与所述数据指纹库单元存储的数据指纹进行比较, 发送比较结果; 元数据管理及更新单元, 用于接收所述比较结果, 并在所述比较结果为数 据指纹相同时, 通过所述全局元数据管理模块的协调, 更新去重后数据段的元 数据, 发送给所述重复数据删除元数据池设备。
12、如权利要求 11所述的块级虚拟化存储设备上实现重复数据删除的系统, 其特征在于, 所述重复数据删除最小数据操作单元为块的整数倍、 比特位的整 数倍或字节的整数倍。
PCT/CN2011/077890 2011-06-13 2011-08-01 块级虚拟化存储设备上实现重复数据删除的方法及系统 WO2012171244A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/380,935 US20120317084A1 (en) 2011-06-13 2011-08-01 Method and system for achieving data de-duplication on a block-level storage virtualization device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN 201110156839 CN102221982B (zh) 2011-06-13 2011-06-13 块级虚拟化存储设备上实现重复数据删除的方法及系统
CN201110156839.0 2011-06-13

Publications (1)

Publication Number Publication Date
WO2012171244A1 true WO2012171244A1 (zh) 2012-12-20

Family

ID=44778543

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/077890 WO2012171244A1 (zh) 2011-06-13 2011-08-01 块级虚拟化存储设备上实现重复数据删除的方法及系统

Country Status (2)

Country Link
CN (1) CN102221982B (zh)
WO (1) WO2012171244A1 (zh)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10133747B2 (en) 2012-04-23 2018-11-20 International Business Machines Corporation Preserving redundancy in data deduplication systems by designation of virtual device
US9779103B2 (en) 2012-04-23 2017-10-03 International Business Machines Corporation Preserving redundancy in data deduplication systems
US9262428B2 (en) * 2012-04-23 2016-02-16 International Business Machines Corporation Preserving redundancy in data deduplication systems by designation of virtual address
US8996881B2 (en) 2012-04-23 2015-03-31 International Business Machines Corporation Preserving redundancy in data deduplication systems by encryption
CN102882885B (zh) * 2012-10-17 2015-07-01 北京卓微天成科技咨询有限公司 一种提高云计算数据安全的方法及系统
WO2015100639A1 (zh) * 2013-12-31 2015-07-09 华为技术有限公司 一种去重方法装置与系统
CN105373346B (zh) * 2015-10-23 2018-06-29 成都卫士通信息产业股份有限公司 一种虚拟化存储方法及存储装置
US10235396B2 (en) * 2016-08-29 2019-03-19 International Business Machines Corporation Workload optimized data deduplication using ghost fingerprints
EP3659042B1 (en) * 2017-08-25 2021-10-06 Huawei Technologies Co., Ltd. Apparatus and method for deduplicating data
CN109918018B (zh) * 2017-12-13 2020-06-16 华为技术有限公司 一种数据存储方法及存储设备
CN108845764A (zh) * 2018-05-30 2018-11-20 郑州云海信息技术有限公司 一种io数据的处理方法及装置
CN109445702B (zh) * 2018-10-26 2019-12-06 黄淮学院 一种块级数据去重存储系统
CN109684238A (zh) * 2018-12-19 2019-04-26 湖南国科微电子股份有限公司 一种固态硬盘映射关系的存储方法、读取方法及固态硬盘
CN111628909B (zh) * 2020-05-25 2021-08-20 上海德吾信息科技有限公司 一种用于无线通信的数据重复发送标记系统及方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582076A (zh) * 2009-06-24 2009-11-18 浪潮电子信息产业股份有限公司 一种基于数据库的重复数据删除方法
CN101809559A (zh) * 2007-09-05 2010-08-18 伊姆西公司 在虚拟化服务器和虚拟化存储环境中的去重复
CN101908077A (zh) * 2010-08-27 2010-12-08 华中科技大学 一种适用于云备份的重复数据删除方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8069191B2 (en) * 2006-07-13 2011-11-29 International Business Machines Corporation Method, an apparatus and a system for managing a snapshot storage pool
US20080243769A1 (en) * 2007-03-30 2008-10-02 Symantec Corporation System and method for exporting data directly from deduplication storage to non-deduplication storage
CN101741536B (zh) * 2008-11-26 2012-09-05 中兴通讯股份有限公司 数据级容灾方法、系统和生产中心节点

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101809559A (zh) * 2007-09-05 2010-08-18 伊姆西公司 在虚拟化服务器和虚拟化存储环境中的去重复
CN101582076A (zh) * 2009-06-24 2009-11-18 浪潮电子信息产业股份有限公司 一种基于数据库的重复数据删除方法
CN101908077A (zh) * 2010-08-27 2010-12-08 华中科技大学 一种适用于云备份的重复数据删除方法

Also Published As

Publication number Publication date
CN102221982B (zh) 2013-09-11
CN102221982A (zh) 2011-10-19

Similar Documents

Publication Publication Date Title
WO2012171244A1 (zh) 块级虚拟化存储设备上实现重复数据删除的方法及系统
US20120317084A1 (en) Method and system for achieving data de-duplication on a block-level storage virtualization device
US8745336B2 (en) Offloading storage operations to storage hardware
US10031703B1 (en) Extent-based tiering for virtual storage using full LUNs
US8266099B2 (en) Offloading storage operations to storage hardware using a third party server
US10466912B2 (en) Operation method of distributed memory disk cluster storage system
US8095577B1 (en) Managing metadata
Vaghani Virtual machine file system
US7415488B1 (en) System and method for redundant storage consistency recovery
US7293154B1 (en) System and method for optimizing storage operations by operating only on mapped blocks
US7424592B1 (en) System and method for implementing volume sets in a storage system
US20090300302A1 (en) Offloading storage operations to storage hardware using a switch
US20120011176A1 (en) Location independent scalable file and block storage
US20130179480A1 (en) System and method for operating a clustered file system using a standalone operation log
US20150288758A1 (en) Volume-level snapshot management in a distributed storage system
US20050114595A1 (en) System and method for emulating operating system metadata to provide cross-platform access to storage volumes
US20050257083A1 (en) Transaction-based storage system and method that uses variable sized objects to store data
CA2953206A1 (en) Systems and methods for optimized signature comparisons and data replication
US7617259B1 (en) System and method for managing redundant storage consistency at a file system level
US10572184B2 (en) Garbage collection in data storage systems
JP2012113704A (ja) フラッシュ・コピーのデータ圧縮を用いたデータ・アーカイブのためのシステム、方法、コンピュータ・プログラム(フラッシュ・コピーのデータ圧縮を用いたデータ・アーカイブ)
CN102375695B (zh) 一种磁盘的访问方法及计算机系统
US20110246731A1 (en) Backup system and backup method
WO2015097757A1 (ja) ストレージシステム及び重複排除制御方法
US20200319986A1 (en) Systems and methods for sequential resilvering

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 13380935

Country of ref document: US

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11867731

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11867731

Country of ref document: EP

Kind code of ref document: A1