WO2017113123A1 - 重复数据删除方法及存储设备 - Google Patents

重复数据删除方法及存储设备 Download PDF

Info

Publication number
WO2017113123A1
WO2017113123A1 PCT/CN2015/099572 CN2015099572W WO2017113123A1 WO 2017113123 A1 WO2017113123 A1 WO 2017113123A1 CN 2015099572 W CN2015099572 W CN 2015099572W WO 2017113123 A1 WO2017113123 A1 WO 2017113123A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
data
data block
storage device
fingerprint
Prior art date
Application number
PCT/CN2015/099572
Other languages
English (en)
French (fr)
Inventor
张宗全
张程伟
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to JP2018500840A priority Critical patent/JP6537214B2/ja
Priority to PCT/CN2015/099572 priority patent/WO2017113123A1/zh
Priority to KR1020177026169A priority patent/KR102082765B1/ko
Priority to CN201580002563.7A priority patent/CN107430602B/zh
Priority to EP15911754.8A priority patent/EP3264285A4/en
Priority to SG11201707075SA priority patent/SG11201707075SA/en
Publication of WO2017113123A1 publication Critical patent/WO2017113123A1/zh
Priority to US15/959,273 priority patent/US10613976B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0253Garbage collection, i.e. reclamation of unreferenced memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3091Data deduplication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1041Resource optimization
    • G06F2212/1044Space efficiency improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system

Definitions

  • the present invention relates to the field of information technology, and in particular, to a data deduplication method and a storage device.
  • Step 1 The storage device divides the same data stream into data blocks, and specifically may use a fixed length block or a variable length block algorithm.
  • Step 2 The storage device calculates a fingerprint of the data block, and the fingerprint is also referred to as a feature value.
  • Step 3 The storage device compares the fingerprint of the data block with the fingerprint of the unique data block (also referred to as a non-repetitive data block) that the storage device has stored. If the fingerprint of the data block is the same as the fingerprint of the unique data block that the storage device has stored, Then, step 4 is performed; if the fingerprint of the data block is different from the fingerprint of the unique data block that the storage device has stored, step 5 is performed.
  • the fingerprint of the data block also referred to as a non-repetitive data block
  • Step 4 The storage device no longer stores the data block, increments the reference count of the data block that the storage device has stored with the same fingerprint as the data block, and performs step 6.
  • Step 5 The storage device sequentially stores the data block as a unique data block in the order of a logical address (LA) of the data block to a physical address (PA) of the data container of the storage device, and
  • the fingerprint of the data block The metadata is sequentially stored in the physical address of the fingerprint container of the storage device in the order of the logical address of the data block, the address identifier of the metadata of the fingerprint is generated, the mapping of the address identifier of the metadata of the fingerprint and the metadata of the fingerprint is performed, and execution is performed.
  • the metadata of the fingerprint of the data block includes the fingerprint of the data block and the physical address of the data block.
  • the address identifier of the metadata of the fingerprint may be the physical address itself of the metadata storing the fingerprint.
  • the address identifier of the metadata of the fingerprint may also be a logical identifier that uniquely identifies the metadata of the fingerprint.
  • the specific storage device may allocate a global unique identifier for the metadata of the fingerprint corresponding to the unique data block, and the logical address.
  • the address identifier of the metadata of the fingerprints of consecutive multiple unique data blocks is linearly incremented.
  • the mapping between the address identifier of the metadata of the fingerprint and the metadata of the fingerprint is established, so as to facilitate the fingerprint query of the metadata loaded in the fingerprint in the subsequent deduplication operation.
  • Step 6 The storage device establishes a mapping between the logical address of the data block and the fingerprint, and establishes a mapping between the fingerprint and the physical address storing the unique data block.
  • a storage device with deduplication function not only the unique data block stored by the storage device can be accessed through the logical address, but also the fingerprint corresponding to the unique data block is deleted after the stored unique data block is deleted. Therefore, in a storage device with deduplication function, the mapping between the logical address of the data block, the fingerprint, and the physical address of the unique data block corresponding to the fingerprint is indispensable.
  • the storage device deduplicates the stored data continuously, which saves the physical space of the storage device, the storage device performs step 6 to establish a large number of mapping relationships, which seriously consumes the memory space of the storage device.
  • the present invention provides a method for deduplication, comprising:
  • the storage device receives the first data stream
  • the storage device divides the first data stream to obtain n data blocks; the logical addresses of the n data blocks are consecutive; the n data blocks include a first data block, and the logical address of the first data block is a first address in a logical address corresponding to the n data blocks; n is an integer not less than 2;
  • the storage device calculates the n data blocks to obtain fingerprints of the n data blocks
  • the storage device When the storage device does not find the same fingerprint as any one of the fingerprints of the n data blocks, the storage device follows the order of the logical addresses of the n data blocks according to the n data blocks. Continuously storing to the first storage area; wherein the physical address of the first data block in the first storage area is a first physical address;
  • the storage device continuously stores the metadata of the fingerprints of the n data blocks in the order of the logical addresses of the n data blocks to the second storage area; the elements of any one of the fingerprints of the n data blocks
  • the data includes any of the fingerprints and a physical address of the fingerprint stored in the second storage area;
  • the storage device establishes a mapping between an address identifier of the metadata of each of the fingerprints of the n data blocks and metadata;
  • the storage device establishes a mapping between a logical address of the first data block and an aggregate address, where the aggregated address includes a physical address of the aggregated data block and an address identifier of the metadata of the aggregated fingerprint; the physicality of the aggregated data block The address includes the first physical address and a physical address length of the n data blocks stored in the first storage area; an address identifier of the metadata of the aggregated fingerprint includes metadata of a fingerprint of the first data block The address identifier and the number of address identifiers of the metadata of the fingerprints of the n data blocks.
  • the storage device reduces the number of mappings, thereby saving the memory space of the storage device, and determining whether the metadata of the fingerprint needs to be deleted according to the mapping relationship.
  • the first storage area and the second storage area are containers. Further, the first storage area and the second storage area may be the same storage area.
  • the storing, by the storage device, the mapping between the logical address of the first data block and the aggregated address specifically includes:
  • the storage device establishes a mapping between a logical address of the first data block and an address identifier of a physical address of the aggregated data block and metadata of the aggregated fingerprint.
  • the storing, by the storage device, the mapping between the logical address of the first data block and the aggregated address specifically includes:
  • the storage device establishes a mapping between a logical address of the first data block and an address identifier of the metadata of the aggregated fingerprint and an address identifier of the metadata of the aggregated fingerprint and a physical address of the aggregated data block.
  • the storing, by the storage device, the mapping between the logical address of the first data block and the aggregated address specifically includes:
  • the storage device establishes a mapping between a logical address of the first data block and a physical address of the aggregated data block, and a mapping between a physical address of the aggregated data block and an address identifier of the metadata of the aggregated fingerprint.
  • the method further includes:
  • the storage device establishes a mapping between a logical address of the first data block and the aggregated address Before shooting,
  • the storage device determines that a physical address length of the n data blocks stored in the first storage area does not exceed a compression window of the storage device.
  • the method further includes: the storage device compressing, according to the compressed window, the stored in the first storage area n data blocks.
  • the method further includes:
  • the storage device receives a second data stream
  • the storage device divides the second data stream to obtain n data blocks; the logical addresses of the n data blocks of the second data stream are continuous; and the n data blocks of the second data stream include the second data block,
  • the logical address of the second data block is the first address in the logical address corresponding to the n data blocks of the second data stream;
  • the storage device calculates n data blocks of the second data stream to obtain fingerprints of n data blocks of the second data stream;
  • the fingerprints of the data blocks are the same, and the storage device establishes a mapping between the logical address of the second data block and the aggregated address; wherein the data block sequence position refers to the first data stream and the The relative position of each data block in n data blocks in any of the data streams of the second data stream.
  • the method further includes:
  • the storage device establishes an index of a first fingerprint in the fingerprints of the n data blocks of the first data stream, where the index of the first fingerprint includes metadata of the first fingerprint and the first fingerprint The mapping of the address identifier.
  • the remainder of the first fingerprint in the metadata of the first fingerprint divided by the specific integer satisfies a specific value.
  • the first fingerprint in the metadata of the first fingerprint is randomly extracted or extracted at a certain interval in the metadata of the fingerprint stored in the second storage area.
  • a logical address of the first data block is a tail address in a logical address corresponding to n data blocks of the first data stream; a logic of the second data block The address is a tail address in a logical address corresponding to n data blocks of the second data stream.
  • the mapping between the logical address and the aggregated address of the first data block and the mapping between the logical address and the aggregated address of the second data block include a mapping address direction identifier.
  • an embodiment of the present invention provides a data deletion method, including:
  • the storage device receives the first data stream
  • the storage device divides the first data stream to obtain n data blocks; the logical addresses of the n data blocks are consecutive; the n data blocks include a first data block, and the logical address of the first data block is a first address in a logical address corresponding to the n data blocks; n is an integer not less than 2;
  • the storage device calculates the n data blocks to obtain fingerprints of the n data blocks
  • the storage device When the storage device does not find the same fingerprint as any one of the fingerprints of the n data blocks, the storage device follows the order of the logical addresses of the n data blocks according to the n data blocks. Continuously storing to the first storage area; wherein the first storage area The physical address of the first data block is stored in the domain as a first physical address;
  • the storage device continuously stores the metadata of the fingerprints of the n data blocks in the order of the logical addresses of the n data blocks to the second storage area; the elements of any one of the fingerprints of the n data blocks
  • the data includes any of the fingerprints and a physical address of the fingerprint stored in the second storage area;
  • the storage device establishes a mapping between an address identifier of the metadata of each of the fingerprints of the n data blocks and metadata;
  • the storage device receives a second data stream
  • the storage device divides the second data stream to obtain n data blocks; the logical addresses of the n data blocks of the second data stream are continuous; and the n data blocks of the second data stream include the second data block,
  • the logical address of the second data block is the first address in the logical address corresponding to the n data blocks of the second data stream;
  • the storage device calculates n data blocks of the second data stream to obtain fingerprints of n data blocks of the second data stream;
  • the storage device establishes a mapping between a logical address of the second data block and an aggregated address; wherein the data block sequence position refers to the first data stream and the second The relative position of each data block in n data blocks in any data stream of the data stream; the aggregate address includes an address address of the physical address of the aggregated data block and the metadata of the aggregated fingerprint; the aggregated data block The physical address includes the first physical address and the storing of the n data blocks in the first storage area The physical address length; the address of the metadata of the aggregated fingerprint identifies the address identifier of the metadata including the fingerprint of the first data block and the number of the address identifier of the metadata of the fingerprint of the n data blocks.
  • the first storage area and the second storage area are containers. Further, the first storage area and the second storage area may be the same storage
  • the storing, by the storage device, the mapping between the logical address of the second data block and the aggregated address specifically includes:
  • the storage device establishes a mapping between a logical address of the second data block and an address identifier of the physical address of the aggregated data block and the metadata of the aggregated fingerprint.
  • the storing, by the storage device, the mapping between the logical address of the second data block and the aggregated address specifically includes:
  • the storage device establishes a mapping between a logical address of the second data block and an address identifier of the metadata of the aggregated fingerprint and an address identifier of the metadata of the aggregated fingerprint and a physical address of the aggregated data block.
  • the storing, by the storage device, the mapping between the logical address of the second data block and the aggregated address specifically includes:
  • the storage device establishes a mapping between a logical address of the second data block and a physical address of the aggregated data block, and a mapping of a physical address of the aggregated data block and an address identifier of the metadata of the aggregated fingerprint.
  • the method further includes:
  • the storage device determines that a physical address length of the n data blocks stored in the first storage area does not exceed a compression window of the storage device.
  • the method further includes: the storage device compressing, according to the compressed window, the stored in the first storage area n data blocks.
  • the method further includes: the storage device establishing an index of a first fingerprint in a fingerprint of the n data blocks of the first data stream, where
  • the index of the first fingerprint includes a mapping of the first fingerprint and an address identifier of the metadata of the first fingerprint.
  • the remainder of the first fingerprint in the metadata of the first fingerprint divided by the specific integer satisfies a specific value.
  • the first fingerprint in the metadata of the first fingerprint is randomly extracted or extracted at a certain interval in the metadata of the fingerprint stored in the second storage area.
  • a logical address of the first data block is a tail address in a logical address corresponding to n data blocks of the first data stream; a logic of the second data block The address is a tail address in a logical address corresponding to n data blocks of the second data stream.
  • the mapping between the logical address and the aggregated address of the second data block includes a mapping address direction identifier.
  • the present invention further provides a storage device, which is used as a storage device in various possible implementations of the first aspect and the second aspect, respectively, to respectively perform various possible implementations of the first aspect and the second aspect of the present invention.
  • the storage device comprises a structural unit that implements the first aspect and various possible implementations of the second aspect, or the storage device comprises an interface and a processor to perform various possible implementations of the first and second aspects of the invention, respectively.
  • the present invention also provides a non-transitory computer readable storage medium and computer program product, when the memory-loaded non-volatile computer readable storage medium of the storage device provided by the present invention and computer instructions contained in the computer program product
  • the central processing unit (CPU) of the storage device executes the computer instruction
  • the storage device is respectively caused to perform various possible implementations of the first aspect and the second aspect of the present invention.
  • FIG. 1 is a schematic structural diagram of a storage device according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of metadata for storing non-duplicate data and a fingerprint according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of an index of a fingerprint according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of metadata for storing non-duplicate data and a fingerprint according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of an index of a fingerprint according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a storage device according to an embodiment of the present invention.
  • the storage device with the deduplication function includes a central processing unit (CPU) and a memory 102.
  • the CPU 101 executes computer instructions in the memory 102 to implement the deduplication operation described in the embodiment of the present invention.
  • a Field Programmable Gate Array (FPGA) or other hardware performs all operations of data deduplication in the embodiment of the present invention, or FPGA or other hardware and CPU respectively perform the implementation of the present invention.
  • a partial operation of deduplication is performed to implement the deduplication operation described in the embodiment of the present invention.
  • the embodiment of the present invention is generally described as a processor of the storage device for implementing the data deduplication operation in the embodiment of the present invention, and the storage device further includes an interface for receiving the data stream.
  • the interface communicates with the processor.
  • the storage device in the embodiment of the present invention further includes a persistent storage medium for storing the unique data block after the deduplication, the metadata of the fingerprint, and the like.
  • a data stream represents a data source, such as a file, or the same application.
  • a file of 1 M size can be divided into several data blocks. If the file is partially modified, most of the data of the modified file is the same as the data of the file before the modification, only a small amount of data is different from the data of the file before the modification, and the modified file has the same data as the file before the modification.
  • the positions of the blocks in the sequence of data blocks are also substantially the same. In the embodiment of the present invention, such attributes are referred to as repeated partiality of data blocks.
  • the storage device determines that one of the data blocks in a data stream is a duplicate data block, the probability that the data block adjacent to the data block is also a duplicate data block is high. Therefore, the storage device receives the data stream, divides the data stream into data blocks, calculates a fingerprint of the data block, and queries whether the same fingerprint is stored in the storage device. If the same fingerprint is not stored, the data block is a non-repeating data block. The storage device continuously stores the data blocks in the data stream that are not duplicated with the unique data blocks that have been stored, in the order of the logical addresses of the data blocks, to the physical addresses of the specific area of the storage device.
  • the storage device specific area may be a container, and is used for data blocks that are not repeatedly stored in one data stream continuously stored on the physical address in the order of logical addresses.
  • the storage device continuously stores the metadata of the fingerprint of the non-repeating data block in the data stream in the order of the logical address of the non-repeating data block to the physical address of the specific area of the storage device, and the metadata storage manner of the fingerprint It is advantageous to use the data block to repeat the locality, and load the metadata of the fingerprint of the non-repeating data block with consecutive logical addresses into the memory. Improve the hit rate of fingerprint queries during deduplication.
  • the metadata storage area of the fingerprint may be part of the aforementioned container storing the non-repeating data blocks in the data stream, or may be a separate container.
  • the logical address contiguousness of the data block in the embodiment of the present invention means that the logical address end position of one data block is the start position of the logical address of another data block.
  • the physical address continuation means that the end position of the physical address storing one data block is the starting position of the physical address storing another data block.
  • the data blocks in the data stream that are not duplicated with the unique data blocks already stored in the storage device are successively stored in the physical address of the specific area of the storage device in the order of the logical addresses of the data blocks, and the physical addresses of the storage data blocks are consecutive.
  • the data blocks are successively stored in a certain storage area according to the logical address order of the data blocks, the data blocks are sequentially stored in a certain storage area according to the logical address order of the data blocks, and the logical address order according to the data blocks is
  • the data block is continuously stored with the same meaning in a certain storage area physical address, and the logical block consecutive data blocks are also consecutive in the storage area physical address.
  • the storage device receives the data stream 1 and the data stream 2, and the storage device uses the fixed length blocking algorithm as an example to cut the data stream 1 and the data stream 2 into fixed-length data blocks, respectively.
  • the embodiment of the present invention takes the data in the data stream 1 and the data stream 2 as the first write data as an example, that is, the fixed length data blocks cut into the data stream 1 and the data stream 2 are all in the storage device. The only block.
  • the data stream 1 includes data blocks with consecutive logical addresses: that is, data blocks with logical addresses LA1 and LA16 respectively, and fingerprints corresponding to data blocks with logical addresses LA1 and LA16 are FP1-FP16.
  • Data stream 2 contains data blocks with consecutive logical addresses: logical address The data blocks corresponding to LA30-LA45 and the data blocks with logical addresses LA30-LA45 are FP30-FP45.
  • the storage device continuously stores the data blocks of one data stream in the same container in the logical address order.
  • the storage device continuously stores the data blocks of the stream 1 in the order of the logical addresses LA1-LA16 at the physical address of the container 1.
  • the storage device stores the data blocks of the data stream 1 in the order of the logical addresses LA1-LA16 of the data blocks in the data stream 1 to the physical addresses of the container 1 physical addresses PA1 to PA16 (in the starting physical of the container 1)
  • the address is PA1 as an example), that is, the data blocks with logical addresses LA1-LA16 are sequentially stored to PA1-PA16.
  • the storage device continuously stores the metadata of the fingerprint of the data block of the data stream 1 (the fingerprint of the data block and the physical address storing the data block) in the logical address order of the data block in the data stream 1 at the physical address of the container 3, FP1 and PA1 are stored to PA201, FP2 and PA2 are stored to PA202, FP3 and PA3 are stored to PA203, FP4 and PB4 are stored to PA204, FP5 and PB5 are stored to PA205, FP6 and PA6 are stored to PA206, and FP7 is stored.
  • the storage device establishes the mapping of the metadata of the fingerprint to the metadata of the fingerprint, that is, establishes the mapping between PA201 and FP1 and PA1, establishes the mapping between PA202 and FP2 and PA2, establishes the mapping of PA203 to FP3 and PA3, and establishes PA204 to FP4.
  • mapping with PB4 establishing mapping of PA205 to FP5 and PB5, establishing mapping of PA206 to FP6 and PA6, establishing Mapping of PA207 to FP7 and PA7, establishing mapping of PA208 to FP8 and PA8, establishing mapping of PA209 to FP9 and PB9, establishing mapping of PA210 to FP10 and PB10, establishing mapping of PA211 to FP11 and PA11, establishing PA212 to FP12 and PA12 Map, establish mapping of PA213 to FP13 and PA13, establish mapping of PA214 to FP14 and PB14, establish mapping of PA215 to FP15 and PA15, and establish mapping of PA216 to FP16 and PA16.
  • the PA201-PA216 establishes a mapping between the LA1 and the aggregation address.
  • the aggregation address includes the physical address of the aggregated data block and the address identifier of the metadata of the aggregated fingerprint.
  • the address identifier of the metadata of the aggregated fingerprint includes the data block corresponding to the logical address LA1.
  • the address identifier of the metadata of the fingerprint and the number of address identifiers of the metadata of the fingerprint of the data block corresponding to LA1 to LA16, and the physical address of the aggregated data block includes the physical address PAB1 of the data block storing the logical address LA1 and the container 1
  • the physical address length of the data block of the logical address from LA1 to LA16 is stored.
  • the data block of the logical address from LA1 to LA16 is also referred to as an aggregated data block.
  • the physical address length of the data block in the container 1 storing the logical address from LA1 to LA16 may be represented by the actual physical length.
  • the physical address length of the data block in the container 1 from LA1 to LA16 may also be represented by the number of physical blocks.
  • the physical address of the aggregated data block may be represented as PA1+16, indicating that the storage logical address is from LA1.
  • the physical address of the data block to LA16 is PA1, which has a total length of 16 physical blocks.
  • the address identifier of the metadata of the aggregated fingerprint is represented as PA201+16, and the address identifier of the metadata corresponding to the fingerprint corresponding to the data block whose logical address is LA1 is PA201, and the data block corresponding to LA1 to LA16
  • the number of address identifiers for the fingerprint's metadata is a total of 16.
  • the mapping between the LA1 and the aggregated address of the storage device includes the mapping between the storage device LA1 and the PA1+16 and the PA201+16, and is represented by LA1--->PA1+16 and PA201+16, wherein the PA1+ 16 and PA201+16 are stored in the same field.
  • the key-value can be used, that is, the key is LA1 and the values are PA1+16 and PA201+16.
  • the storage device reduces the number of mappings, thereby saving the memory space of the storage device, and determining whether the metadata of the fingerprint needs to be deleted according to the mapping relationship.
  • the storage device establishes a mapping from LA1 to aggregation address 1, where the aggregation address 1 includes PA1+8 and PA201+8; and the storage device establishes a mapping from LA9 to aggregation address 2, where the aggregation address 2, including PA9+8, PA209+8, can also reduce the number of mappings in the storage device, and can determine whether the metadata of the fingerprint needs to be deleted according to the mapping relationship.
  • the physical address length of the specific aggregated data block can be limited according to specific implementation, and the present invention This is not limited.
  • the mapping between the LA1 and the aggregation address of the storage device includes the mapping between the storage device establishing the address identifier of the metadata of the LA1 and the aggregate fingerprint, and the address identifier of the metadata for establishing the aggregate fingerprint and the physical address of the aggregated data block.
  • the physical address of the aggregated data block includes storage logically The physical address PAB1 of the data block of the address LA1 and the physical address length of the data block of the logical address from LA1 to LA16 in the container 1, one of which is a mapping of LA1 to PA201+16, and PA201+16 to PA1+
  • the mapping of 16 can be expressed as LA1--->PA201+16, PA201+16--->PA1+16, ie the key is LA1, correspondingly, the value is PA201+16; the key is PA201+16, corresponding The value is PA1+16.
  • the storage device establishes a mapping between LA1 and the address identifier 3 of the metadata of the aggregated fingerprint, and a mapping of the address identifier 3 of the metadata of the aggregated fingerprint and the physical address 3 of the aggregated data block, wherein the metadata of the aggregated fingerprint is generated.
  • the address identifier 3 includes PA201+8, the physical address 3 of the aggregated data block includes PA1+8, the storage device establishes a mapping of LA9 and the address identifier 4 of the metadata of the aggregated fingerprint, and the address identifier 4 of the metadata of the aggregated fingerprint and the aggregation The mapping of the physical address 4 of the data block, wherein the address identifier 4 of the metadata of the aggregated fingerprint includes PA 209+8, and the physical address 4 of the aggregated data block includes PA9+8.
  • the mapping between the LA1 and the aggregation address of the storage device includes: mapping the LA1 to the physical address of the aggregated data block, and establishing the mapping between the physical address of the aggregated data block and the address identifier of the metadata of the aggregated fingerprint.
  • LA1--->PA1+16 PA1+16--->PA201+16
  • PA1+16--->PA201+16 PA1+16--->PA201+16
  • the storage device continuously stores the data blocks of the data stream 2 in the order of the logical addresses LA30-LA45 in the physical addresses in the container 2.
  • the data blocks of the data stream 2 are respectively stored in the order of the logical addresses LA30-LA45 of the data blocks in the data stream 2 to the physical addresses of the container 2 physical addresses PA101 to PA116 (the starting physical address of the container 2 is PA101 is an example).
  • the storage device continuously stores the metadata of the fingerprint of the data block of the data stream 2 in the order of the logical addresses LA30-LA45 of the data stream 2 in the order of the logical addresses LA30-LA45 of the data stream 2, that is, stores the FP30 and the PA101 to the PA301, and the FP31.
  • PA302 Store with PA102 to PA302, FP32 and PA103 to PA303, FP33 and PB104 to PA304, FP34 and PB105 to PA305, FP35 and PA106 to PA306, FP36 and PA107 to PA307, FP37 and PA108 Store to PA308, store FP38 and PB109 to PA309, FP39 and PB110 to PA310, FP40 and PA111 to PA311, FP41 and PA112 to PA312, FP42 and PA113 to PA313, FP43 and PB114 to PA314, FP44 and PA115 are stored to PA315, and FP45 and PA116 are stored to PA316.
  • the storage device establishes the mapping of the metadata of the fingerprint to the metadata of the fingerprint, that is, establishes the mapping between PA301 to FP30 and PA101, establishes the mapping of PA302 to FP31 and PA102, establishes the mapping of PA303 to FP32 and PA103, and establishes PA304 to FP33.
  • mapping with PB104 establishing mapping of PA305 to FP34 and PB105, establishing mapping of PA306 to FP35 and PA106, establishing mapping of PA307 to FP36 and PA107, establishing mapping of PA308 to FP37 and PA108, establishing mapping of PA309 to FP38 and PB109, Establish mapping between PA310 to FP39 and PB110, establish mapping of PA311 to FP40 and PA111, establish mapping of PA312 to FP41 and PA112, establish mapping of PA313 to FP42 and PA113, establish mapping of PA314 to FP43 and PB114, establish PA315 to FP44 and PA115 The mapping establishes the mapping of PA316 to FP45 and PA116.
  • the non-duplicate data blocks whose logical addresses are LA30-LA45 are continuously stored at the physical address, that is, stored separately to PA101-PA116, and the LA30-LA45 does not repeat data.
  • the fingerprint metadata of the block is also stored continuously, that is, stored in the PA301-PA316 respectively, and the storage device establishes a mapping between the LA30 and the aggregated address, and the aggregated address includes the physical address of the aggregated data block and the address identifier of the metadata of the aggregated fingerprint, wherein the aggregated fingerprint
  • the address of the metadata identifier includes the address identifier of the metadata of the fingerprint corresponding to the data block whose logical address is LA30 and the address identifier of the metadata of the fingerprint of the data block corresponding to LA30 to LA45, and the physical address of the aggregated data block includes the storage.
  • the physical address of the data block of the data block of the LA30 and the physical address of the data block of the logical address from LA30 to LA45 in the container 2, in the embodiment of the present invention, the data block of the logical address from LA30 to LA45 is also called aggregated data. Piece.
  • the length of PA101 to PA116 can be expressed in terms of actual physical length.
  • the length of the PA 101 to the PA 116 may also be represented by the number 16 of physical blocks.
  • the physical address of the aggregated data block may be represented as PA101+16, and the address identifier of the metadata of the aggregated fingerprint is represented as PA301+16.
  • the mapping between the LA30 and the aggregation address of the storage device includes the mapping between the LA30 and the PA101+16 and the PA301+16, and is represented by the LA30--->PA101+16 and the PA301+16.
  • the storage device needs to establish 32 mappings for data blocks of data stream 2, namely mapping from LA30 to FP30, mapping from FP30 to PA301... mapping from LA45 to FP45
  • mapping from FP45 to PA316 in the embodiment of the present invention, only one mapping needs to be established. Therefore, the storage device reduces the number of mappings, thereby saving the memory space of the storage device, and determining whether the metadata of the fingerprint needs to be deleted according to the mapping relationship.
  • the storage device establishes a mapping from LA30 to aggregation address 4.
  • the aggregation address 5 includes PA101+8 and PA301+8; and the storage device establishes a mapping from LA38 to aggregation address 6, wherein the aggregation address 6 includes PA109+8 and PA309+8, which can also reduce the number of mappings in the storage device.
  • the physical address length of the specifically aggregated data block can be limited according to the specific implementation, which is not limited by the present invention.
  • the mapping between the establishment of the LA30 and the aggregation address by the storage device includes: mapping, establishing, by the storage device, the address identifier of the metadata of the LA30 and the aggregated fingerprint, and establishing the address identifier of the metadata of the aggregated fingerprint and the physical of the aggregated data block.
  • the physical address of the aggregated data block includes the physical address PAB101 of the data block storing the logical address LA30 and the physical address length of the data block storing the logical address from LA30 to LA45 in the container 2, one of which is represented by LA30 to PA301+16.
  • the mapping, and the mapping of PA301+16 to PA101+16, is expressed as LA30--->PA301+16, PA301+16--->PA101+16, and the specific representation can refer to the implementation described above.
  • the storage device establishes a mapping between the LA30 and the address identifier 7 of the metadata of the aggregated fingerprint, and a mapping of the address identifier 7 of the metadata of the aggregated fingerprint and the physical address of the aggregated data block, wherein the metadata of the aggregated fingerprint is
  • the address identifier 7 includes PA201+8, the physical address 7 of the aggregated data block includes PA1+8, the storage device establishes a mapping of the LA9 and the address identifier 8 of the metadata of the aggregated fingerprint, and the address identifier 8 of the metadata of the aggregated fingerprint and the aggregated data.
  • mapping of the physical address of the block wherein the address identifier 8 of the metadata of the aggregated fingerprint includes PA209+8, and the physical address 8 of the aggregated data block includes PA9+8, which can also reduce the number of mappings in the storage device, and the specific aggregated address.
  • the length may be defined according to a specific implementation, which is not limited by the present invention.
  • the mapping between the LA30 and the aggregation address of the storage device includes: mapping the physical address of the LA30 and the aggregated data block by the storage device, and establishing an address identifier of the physical address of the aggregated data block and the metadata of the aggregated fingerprint.
  • the mapping is expressed as LA30--->PA101+16, PA101+16--->PA301+16, and the specific implementation can refer to the implementation described above, and details are not described herein again.
  • the storage device establishes an index of the fingerprint to facilitate subsequent fingerprint searching in the deduplication operation, and reduces the amount of metadata of the storage device cache fingerprint.
  • the fingerprint obtained by dividing the fingerprint in the metadata of the fingerprint of the data block in the data stream 1 and the data stream 2 by a specific integer may be used as a representative fingerprint in the index of the fingerprint (also referred to as a sample fingerprint).
  • a fingerprint is divided by a fingerprint with a remainder of 3 as a sample fingerprint, and a mapping of the fingerprint to the address identifier of the metadata of the fingerprint is established.
  • the fingerprint in the metadata of the fingerprint may be randomly extracted from the metadata of the fingerprint stored by the container 3 and the container 4 or the fingerprint in the fingerprint may be extracted as a sample fingerprint.
  • the fingerprint in the metadata of the fingerprint is extracted from the metadata of the fingerprint stored in the container 3 and the container 4 at a certain interval as the sample fingerprint, and the index of the fingerprint as shown in FIG. 3 is obtained.
  • the storage device loads the index of the fingerprint as shown in FIG. 3 to perform a fingerprint query in the deduplication operation.
  • the storage device receives the data stream 3. As shown in FIG. 4, the data stream 3 is divided into data blocks and the fingerprint of the data block is calculated.
  • the data block with the logical address LA61 is the same as the data block with the logical address LA1.
  • the data block with the address LA62 is the same as the data block with the logical address LA2.
  • the data block of LA63 is the same as the data block with the logical address LA3.
  • the data block of LA64 is the same as the data block with the logical address LA4, and the data block of LA65.
  • the data block of LA66 is the same as the data block with the logical address LA6
  • the data block of LA67 is the same as the data block with the logical address LA7
  • the data block of LA68 and the data block with the logical address of LA8 is the same as the data block with the logical address LA8
  • the data block of LA69 is the same as the data block with logical address LA9
  • the data block of LA70 is the same as the data block with logical address LA10
  • the data block of LA71 is the same as the data block with logical address LA11.
  • the data block with logical address LA1-LA11 and the data block with logical address LA61-LA71 have the same data block sequence position, and the logical address is LA2.
  • the data block has the same data block sequence position as the data block with the logical address LA62, and so on.
  • the data block with the logical address LA11 and the data block with the logical address LA71 have the same data block sequence position.
  • the fingerprint of the data block whose logical address is LA61-LA71 is FP1-FP11 in order.
  • the storage device loads the index of the fingerprint shown in FIG. 3, and finds whether there is the same fingerprint as the data block of the data stream 3 in the index of the fingerprint.
  • the fingerprint index includes fingerprints FP1, FP5, FP9, FP13, FP30, FP34, FP38, and FP42.
  • the storage device determines that the same fingerprints of the data block of the data stream 3 in the index of the fingerprint are FP1, FP5, FP9, and FP13, respectively.
  • the metadata of the fingerprint loaded FP1 and PA1, FP5 and PA9, FP9 and PA9, FP13 and PA13 are identified according to the address of the metadata of the fingerprint corresponding to FP1, FP5, FP9 and FP13 in the index of the fingerprint.
  • the storage device searches for the metadata of the fingerprint and determines the data with the logical address LA61-LA71.
  • the block is a duplicate data block, and the data block with the logical address LA61-LA71 is no longer stored.
  • the physical address of the unique data block corresponding to the data block with the logical address LA01-LA71 is PA1-PA11, and the mapping between the LA61 and the aggregation address is established.
  • the mapping between the LA61 and the aggregation address of the storage device includes mappings between the storage device LA01 and PA1+11 and PA201+11, which are represented by LA61--->PA1+11 and PA201+11. The implementation described above.
  • the mapping between the LA61 and the aggregation address of the storage device includes the mapping between the storage device establishing the address identifier of the metadata of the LA61 and the aggregated fingerprint, and the address identifier of the metadata for establishing the aggregate fingerprint and the physical address of the aggregated data block.
  • the mapping specifically mapping from LA61 to PA201+11, and mapping from PA201+11 to PA1+11, expressed as LA61--->PA201+11, PA201+11--->PA1+11, specifically for reference The implementation described above.
  • the mapping between the storage device establishing LA61 and the aggregation address specifically includes mapping between the storage device establishing LA61 and the physical address of the aggregated data block, and mapping the physical address of the aggregated data block and the address identifier of the aggregated fingerprint metadata.
  • LA61--->PA1+11, PA1+11--->PA201+1 the specific representation can refer to the implementation described above, and details are not described herein again.
  • the data block whose logical address is LA72-LA76 is a non-repeating data block.
  • the data block in order to maintain the locality of the data block, is stored in the container 5 in the order of logical addresses LA72-LA76 (implementation of the present invention)
  • the first physical address of the container 5 is PA401) consecutive physical addresses, which are respectively recorded as physical addresses PA401-PA405.
  • the metadata of the fingerprint of the data block whose logical address is LA72-LA76 is in the container 6 in the order of the logical address LA72-LA76 of the data block (in the embodiment of the present invention, the first object of the container 5)
  • Continuous storage on the physical address of the address PA501 that is, FP72 and PA401 are stored in PA501, FP73 and PA402 are stored in PA502, FP74 and PA403 are stored in PA503, FP75 and PA404 are stored in PA504, and FP76 and PA405 are stored.
  • the PA505 is recorded as the physical address PA501-PA505.
  • the address of the metadata of the fingerprint of the storage device is mapped to the metadata of the fingerprint, that is, the mapping between the PA501 and the FP72 and the PA401 is established, the mapping between the PA502 and the FP73 and the PA402 is established, and the mapping between the PA503 and the FP74 and the PA403 is established, and the PA504 to the FP75 are established.
  • Mapping with PB404 establishes mapping of PA505 to FP76 and PB405.
  • the mapping between the LA 72 and the aggregation address is also performed on the physical address, and the mapping between the LA 72 and the aggregation address is established.
  • the mapping between the establishment of the LA 72 and the aggregation address by the storage device includes establishing LA72 and PA501+5 and PA601+.
  • the specific representation can refer to the implementation described above.
  • the mapping between the LA72 and the aggregation address of the storage device includes the mapping between the storage device establishing the address identifier of the metadata of the LA72 and the aggregated fingerprint, and the address identifier of the metadata for establishing the aggregate fingerprint and the physical address of the aggregated data block.
  • the mapping specifically the mapping between LA72 and PA601+5, and the mapping of PA601+5 to PA501+5, expressed as LA72--->PA601+5, PA601+5--->PA501+5, the specific representation can be referred to The implementation described above.
  • the mapping between the establishment of the LA72 and the aggregation address by the storage device specifically includes mapping between the storage device establishing the physical address of the LA72 and the aggregated data block, and mapping the physical address of the aggregated data block and the address identifier of the metadata of the aggregated fingerprint.
  • the specific representation can refer to the implementation described above, and details are not described herein again.
  • the storage device samples the fingerprint of the non-duplicate data block of the LA64-LA68 and establishes an index of the fingerprint.
  • the fingerprint in the metadata of the fingerprint is extracted from the metadata of the fingerprint stored in the container 6 as a sample.
  • Fingerprint, the index of the new fingerprint based on Figure 3 is obtained, as shown in Figure 5.
  • the storage device establishes the mapping of LA1--->PA1+16 and PA201+16 as an example.
  • the storage device receives the data read request and the logical address carried by the data read request is LA2, the storage device queries LA1. ---> Mapping of PA1+16 and PA201+16, to determine that LA2 is different from LA1 by a logical address, then the storage device reads data from LA1 offset by a physical address corresponding to a logical address.
  • the storage device compresses the stored unique data block using a compression algorithm.
  • the storage device compresses the unique data block to set the compression window, and the compression window refers to the data block length that can be compressed at one time. Therefore, in the implementation of the present invention, the physical address length of the aggregated data block does not exceed the compressed window.
  • the storage device queries the compressed window of the storage device to determine that the physical address length of the aggregated data block does not exceed the compressed window.
  • the data block with the logical address LA1-LA16 can establish the mapping of LA1--->PA1+16 and PA201+16, but if the physical address length of the aggregated data block exceeds the compression window, Establish multiple mappings, such as LA1---> mapping of PA1+8 and PA201+8, and mapping of LA9--->PA9+8 and PA209+8.
  • the storage device compresses the stored non-duplicate data according to the compressed window.
  • the non-duplicate data stored in the container 5 and the metadata of the fingerprint stored in the container 6 may also be stored in the same container, for example, the container 5, in the embodiment of the present invention. This is not limited.
  • the data block with the logical address LA1-LA16 may establish a mapping of LA1--->PA1+16 and PA201+16, and the mapping includes the mapping address direction identifier. Addressed in the order indicated by the logical address increment starting from LBA1.
  • the physical address of the aggregated data block and the address identifier of the metadata of the aggregated fingerprint may be embodied, for example, the physical address of the aggregated data block is incremented from PA1 and the address of the metadata of the aggregated fingerprint is identified from the PA201. Start incrementing.
  • mapping of LBA16--->PA16-16 and PA216-16 can also reduce the number of mappings and save memory space of the storage device, wherein the mapping includes the mapping address direction.
  • the physical address of the aggregated data block and the address identifier of the metadata of the aggregated fingerprint may be embodied, for example, the physical address of the aggregated data block is decremented from the PA16 and the address identifier of the metadata of the aggregated fingerprint is obtained. Decrease from PA216. The embodiments of the present invention are not described again.
  • a fixed length blocking algorithm is used as an example to divide a data block into a data stream.
  • a variable length blocking algorithm such as a Content-Defined Chunking (CDC) algorithm pair, may also be used.
  • the data stream divides the data block.
  • the storage device in the embodiment of the present invention can implement a deduplication operation in the file system.
  • NAS network attached system
  • the logical address in the embodiment of the present invention is a file identifier plus an offset address.
  • the storage device in the embodiment of the present invention may also implement a data block deduplication operation, such as a storage area network (SAN), where the logical address in the implementation column of the present invention is a logical block address. (logical block address, LBA).
  • the address identifier of the metadata of the fingerprint in the embodiment of the present invention may also be a logical identifier that uniquely identifies the metadata of the fingerprint, and the storage device may assign a global unique identifier to the metadata of the fingerprint corresponding to the unique data block, and the logical address
  • the address identifier of the metadata of the fingerprints of the consecutive multiple unique data blocks is linearly incremented.
  • the address identifier of the metadata of the fingerprint of the data block whose logical address is LA1-LA16 may be a Chunk Identifier (Chunk ID).
  • the storage device performs a data deduplication operation on a data stream to determine a unique data block in which the logical address is consecutive in the data stream, and the storage device prints the fingerprint of the unique block of the logical address in the data stream.
  • the metadata is sequentially stored in the logical address order of the unique data blocks to the physical address of the container, and the storage device generates globally unique Chunk IDs for the metadata of the fingerprints of the unique data blocks according to the logical address order of the unique data blocks, and the Chunk IDs are The order of the logical addresses of these unique data blocks is linearly incremented.
  • the unique data block of the same data stream and the metadata of the fingerprint of the unique data block are respectively stored in different containers; the other implementation is the unique data block of the same data stream and the fingerprint of the unique data block. Metadata can also be stored in different storage areas of the same container.
  • a container is used to store unique data blocks and metadata of fingerprints.
  • the structure of the tree may also be used to store the unique data block and the metadata of the fingerprint.
  • the leaf node of the tree may be used to store the unique data block and the metadata of the fingerprint.
  • the embodiment of the present invention may also only construct a continuous data block with consecutive logical addresses.
  • the mapping of the logical address to the aggregated address in the embodiment of the present invention establishes a one-to-one mapping according to the existing implementation of the non-duplicate data block in which the logical address is continuous.
  • an embodiment of the present invention provides a storage device 600, including a receiving unit 601, a dividing unit 602, a calculating unit 603, a storage unit 604, and an establishing unit 605.
  • the receiving unit 601 is configured to receive the first data stream;
  • the dividing unit 602 is configured to divide the first data stream to obtain n data blocks;
  • the logical addresses of the n data blocks are consecutive;
  • the n data blocks include a data block, the logical address of the first data block is a first address in a logical address corresponding to the n data blocks;
  • n is an integer not less than 2;
  • the calculating unit 603 is configured to calculate the n data blocks to obtain a fingerprint of the n data blocks;
  • the storage unit 604 is configured to: when the fingerprints of the fingerprints of the n data blocks are not found in the storage device 600, the n data blocks are The order of the logical addresses of the n data blocks is continuously stored in the first storage area, and the metadata of the fingerprints of
  • the storage device reduces the number of mappings, thereby saving the memory space of the storage device, and determining whether the metadata of the fingerprint needs to be deleted according to the mapping relationship.
  • the first storage area and the second storage area in the storage device 600 are containers. Further, the first storage area and the second storage area may be the same storage area.
  • the establishing unit 605 is specifically configured to establish a mapping between a logical address of the first data block and an address identifier of the physical address of the aggregated data block and the metadata of the aggregated fingerprint.
  • the establishing unit 605 is specifically configured to establish an address identifier mapping of the logical address of the first data block and the metadata of the aggregated fingerprint, and an address identifier of the metadata of the aggregated fingerprint and the aggregated data block.
  • the establishing unit 605 is specifically configured to establish a mapping between a logical address of the first data block and a physical address of the aggregated data block, and a physical address of the aggregated data block and an address of the metadata of the aggregated fingerprint. The mapping of the identity.
  • the storage device 600 further includes a determining unit, where the determining unit is configured to determine the n data stored in the first storage area before establishing a mapping between the logical address of the first data block and the aggregated address The physical address length of the block does not exceed the compression window of the storage device.
  • the storage device 600 further includes a compression unit, configured to compress the n data blocks stored in the first storage area according to the compression window.
  • the receiving unit 601 is further configured to receive the second data stream; the dividing unit 602 is further configured to divide the second data stream to obtain n data blocks; and the n numbers of the second data stream According to the logical address of the block, the n data blocks of the second data stream include a second data block, and the logical address of the second data block is a logical address corresponding to the n data blocks of the second data stream.
  • the establishing unit 605 is further configured to establish an index of the first fingerprint in the fingerprints of the n data blocks of the first data stream, where the index of the first fingerprint includes the first fingerprint and the The mapping of the address identifier of the metadata of the first fingerprint.
  • the storage device 600 is implemented by the storage device 600.
  • the above-mentioned unit can be loaded into the memory of the storage device 600, and the CPU in the storage device 600 executes instructions in the memory.
  • the above units are also referred to as structural units.
  • the embodiment of the present invention further provides a non-volatile computer readable storage medium and a computer program product, when the memory of the CPU of the storage device 600 shown in FIG. 6 is loaded with non-volatile computing.
  • the computer readable storage medium and computer instructions embodied in the computer program product the CPU executes the computer instructions loaded in memory to implement corresponding functions in various implementations of the invention.
  • the disclosed apparatus and method may be implemented in other manners.
  • the division of the units described in the device embodiments described above is only one logical function division, and may be further divided in actual implementation, for example, multiple units or components may be combined or may be integrated into another system, or Some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

此处公开了存储设备执行重复数据删除的方案。在本方案中,根据重复数据局部性原理,将逻辑地址连续的非重复数据块按照逻辑地址顺序存储到连续的物理地址中,将逻辑地址连续的非重复数据块的指纹按照逻辑地址顺序也存储到连续的物理地址中,同时建立逻辑地址连续的非重复数据块中的一个逻辑地址到到聚合地址的映射。

Description

重复数据删除方法及存储设备 技术领域
本发明涉及信息技术领域,特别涉及一种重复数据删除方法及存储设备。
背景技术
随着信息技术地发展,需要存储的数据量急速增加。为缓解无限增长的数据量与相对有限的存储空间之间的矛盾,引入了重复数据删除技术。
重复数据删除技术在具体实现中,主要包括以下步骤:
步骤1、存储设备将同一个数据流划分成数据块,具体可以使用定长分块或变长分块算法。
步骤2、存储设备计算数据块的指纹,指纹也称为特征值。
步骤3、存储设备将数据块的指纹与存储设备已经存储的唯一数据块(也称为非重复数据块)的指纹比较,如果数据块的指纹与存储设备已经存储的唯一数据块的指纹相同,则执行步骤4;如果数据块的指纹与存储设备已经存储的唯一数据块的指纹都不相同,则执行步骤5。
步骤4:存储设备不再存储该数据块,将存储设备已经存储的与该数据块具有相同指纹的数据块的引用计数加1,并执行步骤6。
步骤5:存储设备将该数据块作为唯一数据块按照数据块的逻辑地址(logical address,LA)的顺序依次存储到存储设备的数据容器(container)的物理地址(physical address,PA)中,将该数据块的指纹 的元数据按照数据块逻辑地址的顺序依次存储到存储设备的指纹容器的物理地址中,生成该指纹的元数据的地址标识,建立指纹的元数据的地址标识与指纹的元数据的映射,执行步骤6。其中,数据块的指纹的元数据包括该数据块的指纹与存储该数据块的物理地址。指纹的元数据的地址标识可以为存储指纹的元数据的物理地址本身。另一种实现,指纹的元数据的地址标识也可以为唯一标识该指纹的元数据的逻辑标识,具体存储设备可以为唯一的数据块对应的指纹的元数据分配一个全局唯一标识,并且逻辑地址连续的多个唯一数据块的指纹的元数据的地址标识线性递增。建立指纹的元数据的地址标识与指纹的元数据的映射,以方便后续重复数据删除操作中加载指纹的元数据进行指纹查询。
步骤6:存储设备建立该数据块的逻辑地址与指纹的映射,建立指纹与存储该唯一数据块的物理地址的映射。在具有重复数据删除功能的存储设备中,不仅要通过逻辑地址能够访问到存储设备存储的唯一的数据块,还需要确定当存储的唯一数据块被删除后,删除该唯一数据块对应的指纹。因此,在具有重复数据删除功能的存储设备中,数据块的逻辑地址、指纹以及指纹对应的唯一数据块的物理地址之间的映射缺一不可。
然而,尽管存储设备对存储的数据不断进行重复数据删除,节省了存储设备的物理空间,但存储设备执行步骤6会建立大量的映射关系,严重消耗了存储设备的内存空间。
发明内容
第一方面,本发明提供了一种重复数据删除方法,包括:
存储设备接收第一数据流;
所述存储设备划分所述第一数据流得到n个数据块;所述n个数据块的逻辑地址连续;所述n个数据块包括第一数据块,所述第一数据块的逻辑地址为所述n个数据块对应的逻辑地址中的首地址;n为不小于2的整数;
所述存储设备计算所述n个数据块得到所述n个数据块的指纹;
当所述存储设备中没有查找到与所述n个数据块的指纹中任一指纹相同的指纹时,所述存储设备将所述n个数据块按照所述n个数据块的逻辑地址的顺序连续存储到第一存储区域;其中,所述第一存储区域中存储所述第一数据块的物理地址为第一物理地址;
所述存储设备将所述n个数据块的指纹的元数据按照所述n个数据块逻辑地址的顺序连续存储到第二存储区域;所述n个数据块的指纹中的任一指纹的元数据包括所述任一指纹以及所述第二存储区域中存储所述任一指纹的物理地址;
所述存储设备建立所述n个数据块的指纹中的每一指纹的元数据的地址标识与元数据的映射;
所述存储设备建立所述第一数据块的逻辑地址与聚合地址的映射,其中,所述聚合地址包括聚合数据块的物理地址和聚合指纹的元数据的地址标识;所述聚合数据块的物理地址包括所述第一物理地址以及所述第一存储区域中存储所述n个数据块的物理地址长度;所述聚合指纹的元数据的地址标识包括所述第一数据块的指纹的元数据 的地址标识以及所述n个数据块的指纹的元数据的地址标识的数量。本发明实施例中,存储设备减少了映射的数量,从而节省了存储设备的内存空间,同时可以根据映射关系确定是否需要删除指纹的元数据。可选的,第一存储区域和第二存储区域为容器。进一步的,第一存储区域和第二存储区域可以为同一存储区域。
结合第一方面,在第一种可能的实现方式中,所述存储设备建立所述第一数据块的逻辑地址与所述聚合地址的映射,具体包括:
所述存储设备建立所述第一数据块的逻辑地址与所述聚合数据块的物理地址和所述聚合指纹的元数据的地址标识的映射。
结合第一方面,在第二种可能的实现方式中,所述存储设备建立所述第一数据块的逻辑地址与所述聚合地址的映射,具体包括:
所述存储设备建立所述第一数据块的逻辑地址与所述聚合指纹的元数据的地址标识映射以及所述聚合指纹的元数据的地址标识与所述聚合数据块的物理地址的映射。
结合第一方面,在第三种可能的实现方式中,所述存储设备建立所述第一数据块的逻辑地址与所述聚合地址的映射,具体包括:
所述存储设备建立所述第一数据块的逻辑地址与所述聚合数据块的物理地址的映射以及所述聚合数据块的物理地址与所述聚合指纹的元数据的地址标识的映射。
结合第一方面,或第一方面的第一至三种可能的任一实现方式中,在第四种可能的实现方式中,所述方法还包括:
所述存储设备建立所述第一数据块的逻辑地址与所述聚合地址的映 射之前,
所述存储设备确定所述第一存储区域中存储的所述n个数据块的物理地址长度不超过存储设备的压缩窗口。
结合第一方面的第四种可能的实现方式,在第五种可能的实现方式中,所述方法还包括:所述存储设备根据所述压缩窗口压缩所述第一存储区域中存储的所述n个数据块。
结合第一方面,或第一方面的第一至三种可能的任一实现方式中,在第六种可能的实现方式中,还包括:
所述存储设备接收第二数据流;
所述存储设备划分所述第二数据流得到n个数据块;所述第二数据流的n个数据块的逻辑地址连续;所述第二数据流的n个数据块包括第二数据块,所述第二数据块的逻辑地址为所述第二数据流的n个数据块对应的逻辑地址中的首地址;
存储设备计算所述第二数据流的n个数据块得到所述第二数据流的n个数据块的指纹;
当所述存储设备查询所述第一数据流的n个数据块的指纹的元数据确定相同数据块序列位置中的所述第二数据流的n个数据块与所述第一数据块的n个数据块的指纹相同,则所述存储设备建立所述第二数据块的逻辑地址与所述聚合地址的映射;其中,所述数据块序列位置是指在所述第一数据流和所述第二数据流的任一数据流中,每一个数据块在n个数据块中的相对位置。
结合第一方面,在第七种可能的实现方式中,所述方法还包括:
所述存储设备建立所述第一数据流的所述n个数据块的指纹中的第一指纹的索引,所述第一指纹的索引包括所述第一指纹与所述第一指纹的元数据的地址标识的映射。可选的,第一指纹的元数据中的第一指纹除以特定整数得到的余数满足特定值。可选的,在所述第二存储区域存储的指纹的元数据中随机抽取或按一定间隔抽取到所述第一指纹的元数据中的所述第一指纹。
在第一方面的各种可能实现方式中,所述第一数据块的逻辑地址为所述第一数据流的n个数据块对应的逻辑地址中的尾地址;所述第二数据块的逻辑地址为所述第二数据流的n个数据块对应的逻辑地址中的尾地址。可选的,所述第一数据块的逻辑地址与聚合地址的映射中与所述第二数据块的逻辑地址与聚合地址的映射中均包含映射地址方向标识。
第二方面,本发明实施例提供了一种重复数据删除方法,包括:
存储设备接收第一数据流;
所述存储设备划分所述第一数据流得到n个数据块;所述n个数据块的逻辑地址连续;所述n个数据块包括第一数据块,所述第一数据块的逻辑地址为所述n个数据块对应的逻辑地址中的首地址;n为不小于2的整数;
所述存储设备计算所述n个数据块得到所述n个数据块的指纹;
当所述存储设备中没有查找到与所述n个数据块的指纹中任一指纹相同的指纹时,所述存储设备将所述n个数据块按照所述n个数据块的逻辑地址的顺序连续存储到第一存储区域;其中,所述第一存储区 域中存储所述第一数据块的物理地址为第一物理地址;
所述存储设备将所述n个数据块的指纹的元数据按照所述n个数据块逻辑地址的顺序连续存储到第二存储区域;所述n个数据块的指纹中的任一指纹的元数据包括所述任一指纹以及所述第二存储区域中存储所述任一指纹的物理地址;
所述存储设备建立所述n个数据块的指纹中的每一指纹的元数据的地址标识与元数据的映射;
所述存储设备接收第二数据流;
所述存储设备划分所述第二数据流得到n个数据块;所述第二数据流的n个数据块的逻辑地址连续;所述第二数据流的n个数据块包括第二数据块,所述第二数据块的逻辑地址为所述第二数据流的n个数据块对应的逻辑地址中的首地址;
存储设备计算所述第二数据流的n个数据块得到所述第二数据流的n个数据块的指纹;
当所述存储设备查询所述第一数据流的n个数据块的指纹的元数据确定相同数据块序列位置中的所述第二数据流的n个数据块与所述第一数据块的n个数据块的指纹相同,则所述存储设备建立所述第二数据块的逻辑地址与聚合地址的映射;其中,所述数据块序列位置是指在所述第一数据流和所述第二数据流的任一数据流中,每一个数据块在n个数据块中的相对位置;所述聚合地址包括聚合数据块的物理地址和聚合指纹的元数据的地址标识;所述聚合数据块的物理地址包括所述第一物理地址以及所述第一存储区域中存储所述n个数据块的 物理地址长度;所述聚合指纹的元数据的地址标识包括所述第一数据块的指纹的元数据的地址标识以及所述n个数据块的指纹的元数据的地址标识的数量。可选的,第一存储区域和第二存储区域为容器。进一步的,第一存储区域和第二存储区域可以为同一存储区域。
结合第二方面,在第一种可能的实现方式中,所述存储设备建立所述第二数据块的逻辑地址与所述聚合地址的映射,具体包括:
所述存储设备建立所述第二数据块的逻辑地址与所述聚合数据块的物理地址和所述聚合指纹的元数据的地址标识的映射。
结合第二方面,在第二种可能的实现方式中,所述存储设备建立所述第二数据块的逻辑地址与所述聚合地址的映射,具体包括:
所述存储设备建立所述第二数据块的逻辑地址与所述聚合指纹的元数据的地址标识映射以及所述聚合指纹的元数据的地址标识与所述聚合数据块的物理地址的映射。
结合第二方面,在第三种可能的实现方式中,所述存储设备建立所述第二数据块的逻辑地址与所述聚合地址的映射,具体包括:
所述存储设备建立所述第二数据块的逻辑地址与所述聚合数据块的物理地址的映射以及所述聚合数据块的物理地址与所述聚合指纹的元数据的地址标识的映射。
结合第二方面,或第二方面的第一至三种可能的任一实现方式中,在第四种可能的实现方式中,所述方法还包括:
所述存储设备建立所述第二数据块的逻辑地址与所述聚合地址的映射之前,
所述存储设备确定所述第一存储区域中存储的所述n个数据块的物理地址长度不超过存储设备的压缩窗口。
结合第二方面的第四种可能的实现方式,在第五种可能的实现方式中,所述方法还包括:所述存储设备根据所述压缩窗口压缩所述第一存储区域中存储的所述n个数据块。
结合第二方面,在第六种可能的实现方式中,所述方法还包括:所述存储设备建立所述第一数据流的所述n个数据块的指纹中的第一指纹的索引,所述第一指纹的索引包括所述第一指纹与所述第一指纹的元数据的地址标识的映射。可选的,第一指纹的元数据中的第一指纹除以特定整数得到的余数满足特定值。可选的,在所述第二存储区域存储的指纹的元数据中随机抽取或按一定间隔抽取到所述第一指纹的元数据中的所述第一指纹。
在第二方面的各种可能实现方式中,所述第一数据块的逻辑地址为所述第一数据流的n个数据块对应的逻辑地址中的尾地址;所述第二数据块的逻辑地址为所述第二数据流的n个数据块对应的逻辑地址中的尾地址。可选的,所述第二数据块的逻辑地址与聚合地址的映射中包含映射地址方向标识。
相应的,本发明还提供了存储设备,分别用于作为第一方面和第二方面各种可能的实现方式中存储设备,以分别执行本发明第一方面和第二方面的各种可能实现方案。存储设备包括实现第一方面和第二方面各种可能的实现方案的结构单元,或者,存储设备包括接口和处理器以分别执行本发明第一方面和第二方面的各种可能实现方案。
相应地,本发明还提供了非易失性计算机可读存储介质和计算机程序产品,当本发明提供的存储设备的内存加载非易失性计算机可读存储介质和计算机程序产品中包含的计算机指令,存储设备的中央处理单元(Center Processing Unit,CPU)执行该计算机指令时,分别使存储设备执行本发明第一方面和第二方面的各种可能实现方案。
附图说明
图1为本发明实施例存储设备结构示意图;
图2为本发明实施例存储非重复数据以及指纹的元数据示意图;
图3为本发明实施例指纹的索引示意图;
图4为本发明实施例存储非重复数据以及指纹的元数据示意图;
图5为本发明实施例指纹的索引示意图;
图6为本发明实施例存储设备结构示意图。
具体实施例
如图1所示,具有重删功能的存储设备包括中央处理单元101(Center Processing Unit,CPU)和内存102,CPU101执行内存102中的计算机指令实现本发明实施例描述的重复数据删除操作。另外,为节省CPU的计算资源,现场可编程门阵列(Field Programmable Gate Array,FPGA)或其他硬件执行本发明实施例中重复数据删除全部操作,或者,FPGA或其他硬件与CPU分别执行本发明实施例重复数据删除的部分操作,以实现本发明实施例描述的重复数据删除操作。为方便描述,本发明实施例统一描述为存储设备的处理器用于实现本发明实施例重复数据删除操作,存储设备还包括接口,用于接收数据流, 接口和处理器通信。本发明实施例中的存储设备还包括持久化存储介质,用于存储重复数据删除后的唯一数据块、指纹的元数据等。
存储设备在存储数据时,通常相同的数据块会在不同的数据流中重复出现。一个数据流表示一个数据源,例如,一个文件,或者同一个应用等。实际场景中,存储设备在进行重复数据删除操作时,可将1M大小的文件划分为若干个数据块。如果对该文件进行部分修改,这样被修改的文件大部分数据与修改前的文件的数据相同,只有少量数据与修改前的文件的数据不同,并且修改后的文件中与修改前文件相同的数据块的在数据块序列中的位置也基本相同,本发明实施例称这种属性为数据块重复局部性。因此,当存储设备确定一个数据流中的某一个数据块是重复的数据块,则与该数据块相邻的数据块也是重复的数据块的概率很高。因此,存储设备接收数据流,将数据流分成数据块,计算数据块的指纹,查询存储设备中是否存储有相同的指纹,如果没有存储相同的指纹,则表明数据块为不重复的数据块,存储设备将数据流中与已经存储的唯一数据块不重复的数据块,按照这些数据块的逻辑地址的顺序连续存储到存储设备的特定区域的物理地址中。本发明实施例中,存储设备特定的区域可以为容器,用于按照逻辑地址的顺序在物理地址上连续存储的一个数据流中不重复的数据块。同时存储设备将该数据流中不重复的数据块的指纹的元数据按照该不重复的数据块的逻辑地址的顺序连续存储到存储设备特定区域的物理地址上,这种指纹的元数据存储方式有利于利用数据块重复局部性,将逻辑地址连续的不重复的数据块的指纹的元数据加载到内存 中,提高重复数据删除过程中指纹查询的命中率。指纹的元数据存储区域可以是前述存储该数据流中不重复的数据块的容器的一部分,也可以是一个独立的容器。本发明实施例中数据块的逻辑地址连续是指一个数据块的逻辑地址结束位置是另一个数据块的逻辑地址的起始位置。同样,本发明实施例中,物理地址连续是指存储一个数据块的物理地址的结束位置是存储另一个数据块的物理地址的起始位置。将数据流中与存储设备中已经存储的唯一数据块不重复的数据块,按照数据块的逻辑地址的顺序连续存储到存储设备的特定区域的物理地址中,则存储数据块的物理地址连续。
本发明实施例中,按照数据块的逻辑地址顺序将数据块连续存储到某一存储区域、按照数据块的逻辑地址顺序依次将数据块存储到某一存储区域以及按照数据块的逻辑地址顺序将数据块在某一存储区域物理地址连续地存储具有相同的含义,逻辑地址连续的数据块在该存储区域物理地址上也连续。
如图2所示,存储设备接收数据流1和数据流2,以存储设备使用定长分块算法为例,分别将数据流1和数据流2切成固定长度的数据块。为方便说明,本发明实施例以数据流1和数据流2中的数据均为首次写入数据为例,即将数据流1和数据流2切成的固定长度的数据块均为存储设备中的唯一块。
其中,数据流1中包含逻辑地址连续的数据块:即逻辑地址分别为LA1-LA16的数据块,逻辑地址为LA1-LA16的数据块对应的指纹分别为FP1-FP16。数据流2中包含逻辑地址连续的数据块:即逻辑地址 分别为LA30-LA45的数据块,逻辑地址为LA30-LA45的数据块对应的指纹分别为FP30-FP45。
存储设备将一个数据流的数据块按照逻辑地址顺序连续存储在同一个容器中。存储设备将数据流1的数据块按照逻辑地址LA1-LA16的顺序在容器1的物理地址上连续存储。例如,存储设备将数据流1的数据块按照数据块在数据流1中的逻辑地址LA1-LA16的顺序分别存储到容器1物理地址为PA1至PA16的物理地址中(以容器1的起始物理地址为PA1为例),即将逻辑地址为LA1-LA16的数据块依次存储到PA1-PA16。存储设备将数据流1的数据块的指纹的元数据(数据块的指纹及存储该数据块的物理地址)按照数据块在数据流1中的逻辑地址顺序在容器3的物理地址上连续存储,即将FP1与PA1存储到PA201、将FP2与PA2存储到PA202、将FP3与PA3存储到PA203、将FP4与PB4存储到PA204、将FP5与PB5存储到PA205、将FP6与PA6存储到PA206、将FP7与PA7存储到PA207、将FP8与PA8存储到PA208、将FP9与PB9存储到PA209、将FP10与PB10存储到PA210、将FP11与PA11存储到PA211、将FP12与PA12存储到PA212、将FP13与PA13存储到PA213、将FP14与PB14存储到PA214、将FP15与PA15存储到PA215、将FP16与PA16存储到PA216。
存储设备建立指纹的元数据的地址标识到指纹的元数据的映射,即建立PA201到FP1与PA1的映射,建立PA202到FP2与PA2的映射,建立PA203到FP3与PA3的映射,建立PA204到FP4与PB4的映射,建立PA205到FP5与PB5的映射,建立PA206到FP6与PA6的映射,建立 PA207到FP7与PA7的映射,建立PA208到FP8与PA8的映射,建立PA209到FP9与PB9的映射,建立PA210到FP10与PB10的映射,建立PA211到FP11与PA11的映射,建立PA212到FP12与PA12的映射,建立PA213到FP13与PA13的映射,建立PA214到FP14与PB14的映射,建立PA215到FP15与PA15的映射,建立PA216到FP16与PA16的映射。
由于逻辑地址为LA1-LA16的不重复的数据块在物理地址上连续存储,即分别存储到PA1-PA16,而LA1-LA16的不重复的数据块的指纹元数据也连续存储,即分别存储到PA201-PA216,建立LA1与聚合地址的映射,聚合地址包括聚合数据块的物理地址和聚合指纹的元数据的地址标识,其中,聚合指纹的元数据的地址标识包括逻辑地址为LA1的数据块对应的指纹的元数据的地址标识以及LA1到LA16对应数据块的指纹的元数据的地址标识的数量16,聚合数据块的物理地址包括存储逻辑地址为LA1的数据块的物理地址PAB1以及容器1中存储逻辑地址从LA1到LA16的数据块的物理地址长度,本发明实施例中,逻辑地址从LA1到LA16的数据块也称为聚合数据块。可选的,容器1中存储逻辑地址从LA1到LA16的数据块的物理地址长度可以用实际的物理长度表示。可选的,容器1中逻辑地址从LA1到LA16的数据块的物理地址长度也可以使用物理块的数量表示,例如,聚合数据块的物理地址可表示为PA1+16,表示存储逻辑地址从LA1到LA16的数据块的物理地址为PA1,一共有16个物理块的长度。聚合指纹的元数据的地址标识表示为PA201+16,表示逻辑地址为LA1的数据块对应的指纹的元数据的地址标识为PA201,LA1到LA16对应的数据块 的指纹的元数据的地址标识的数量一共有16个。一种实现方式,存储设备建立LA1与聚合地址的映射具体包括存储设备建立LA1与PA1+16和PA201+16的映射,表示为LA1--->PA1+16和PA201+16,其中,PA1+16和PA201+16存储在同一字段中,具体的,可以使用键(key)-值(value)的形式,即key为LA1,value为PA1+16和PA201+16。如果按照现有技术描述的方式,需要为数据流1的数据块建立32条映射,即从LA1到FP1的映射,从PF1到PA1的映射......从LA32到PF32的映射,从PF32到PA16的映射,本发明实施例中,只需要建立1条映射。因此,存储设备减少了映射的数量,从而节省了存储设备的内存空间,同时可以根据映射关系确定是否需要删除指纹的元数据。可选的,本发明实施例中,存储设备建立LA1到聚合地址1的映射,其中,聚合地址1包括PA1+8,PA201+8;存储设备建立LA9到聚合地址2的映射,其中,聚合地址2包括PA9+8,PA209+8,同样可以减少存储设备中的映射数量,同时可以根据映射关系确定是否需要删除指纹的元数据,具体聚合数据块的物理地址长度可根据具体实现限定,本发明对此不作限定。
另一种实现方式,存储设备建立LA1与聚合地址的映射具体包括存储设备建立LA1与聚合指纹的元数据的地址标识的映射,以及建立聚合指纹的元数据的地址标识与聚合数据块的物理地址的映射,其中,聚合的指纹的元数据的地址标识包括逻辑地址为LA1的数据块对应的指纹的元数据的地址标识以及LA1到LA16对应的数据块的指纹的元数据的地址标识的数量16,聚合数据块的物理地址包括存储逻辑地 址为LA1的数据块的物理地址PAB1以及容器1中存储逻辑地址从LA1到LA16的数据块的物理地址长度,其中一种表示方式为LA1到PA201+16的映射,以及PA201+16到PA1+16的映射,可表示为表示为LA1--->PA201+16,PA201+16--->PA1+16,即key为LA1,相应的,value为PA201+16;key为PA201+16,相应的,value为PA1+16。可选的,存储设备建立LA1与聚合指纹的元数据的地址标识3的映射,以及建立聚合指纹的元数据的地址标识3与聚合数据块的物理地址3的映射,其中,聚合指纹的元数据的地址标识3包括PA201+8,聚合数据块的物理地址3包括PA1+8,存储设备建立LA9与聚合指纹的元数据的地址标识4的映射,以及聚合指纹的元数据的地址标识4与聚合数据块的物理地址4的映射,其中,聚合指纹的元数据的地址标识4包括PA209+8,聚合数据块的物理地址4包括PA9+8。
另一种实现方式,存储设备建立LA1与聚合地址的映射具体包括存储设备建立LA1与聚合数据块的物理地址的映射,以及建立聚合数据块的物理地址与聚合指纹的元数据的地址标识的映射,在此不再赘述,可表示为表示为LA1--->PA1+16,PA1+16--->PA201+16即key为LA1,相应的,value为PA1+16;key为PA1+16,相应的,value为PA201+16。
存储设备将数据流2的数据块按照逻辑地址LA30-LA45的顺序在容器2中的物理地址上连续存储。例如,将数据流2的数据块按照数据块在数据流2中的逻辑地址LA30-LA45的顺序分别存储到容器2物理地址为PA101至PA116的物理地址中(以容器2的起始物理地址为 PA101为例)。
存储设备将数据流2的数据块的指纹的元数据按照数据块在数据流2中的逻辑地址LA30-LA45的顺序在容器4的物理地址上连续存储,即将FP30与PA101存储到PA301、将FP31与PA102存储到PA302、将FP32与PA103存储到PA303、将FP33与PB104存储到PA304、将FP34与PB105存储到PA305、将FP35与PA106存储到PA306、将FP36与PA107存储到PA307、将FP37与PA108存储到PA308、将FP38与PB109存储到PA309、将FP39与PB110存储到PA310、将FP40与PA111存储到PA311、将FP41与PA112存储到PA312、将FP42与PA113存储到PA313、将FP43与PB114存储到PA314、将FP44与PA115存储到PA315、将FP45与PA116存储到PA316。存储设备建立指纹的元数据的地址标识到指纹的元数据的映射,即建立PA301到FP30与PA101的映射,建立PA302到FP31与PA102的映射,建立PA303到FP32与PA103的映射,建立PA304到FP33与PB104的映射,建立PA305到FP34与PB105的映射,建立PA306到FP35与PA106的映射,建立PA307到FP36与PA107的映射,建立PA308到FP37与PA108的映射,建立PA309到FP38与PB109的映射,建立PA310到FP39与PB110的映射,建立PA311到FP40与PA111的映射,建立PA312到FP41与PA112的映射,建立PA313到FP42与PA113的映射,建立PA314到FP43与PB114映射,建立PA315到FP44与PA115的映射,建立PA316到FP45与PA116的映射。
由于逻辑地址为LA30-LA45的不重复的数据块在物理地址上连续存储,即分别存储到PA101-PA116,而LA30-LA45的不重复的数据 块的指纹元数据也连续存储,即分别存储到PA301-PA316,存储设备建立LA30与聚合地址的映射,聚合地址包括聚合数据块的物理地址和聚合指纹的元数据的地址标识,其中,聚合指纹的元数据的地址标识包括逻辑地址为LA30的数据块对应的指纹的元数据的地址标识以及LA30到LA45对应的数据块的指纹的元数据的地址标识的数量,聚合数据块的物理地址包括存储逻辑地址为LA30的数据块的物理地址PAB101以及容器2中存储逻辑地址从LA30到LA45的数据块的物理地址长度,本发明实施例中,逻辑地址从LA30到LA45的数据块也称为聚合数据块。可选的,PA101到PA116的长度可以用实际的物理长度表示。可选的,PA101到PA116的长度也可以使用物理块的数量16表示,例如,聚合数据块的物理地址可表示为PA101+16,聚合指纹的元数据的地址标识表示为PA301+16。一种实现方式,存储设备建立LA30与聚合地址的映射具体包括存储设备建立LA30与PA101+16和PA301+16的映射,表示为LA30--->PA101+16和PA301+16,具体表示可参考前面描述的实现。如果按照现有技术描述的方式,存储设备需要为数据流2的数据块建立32条映射,即从LA30到FP30的映射,从FP30到PA301的映射......从LA45到FP45的映射,从FP45到PA316的映射,本发明实施例中,只需要建立1条映射。因此,存储设备减少了映射的数量,从而节省了存储设备的内存空间,同时可以根据映射关系确定是否需要删除指纹的元数据。可选的,本发明实施例中,存储设备建立LA30到聚合地址4的映射,其中,聚合地址5包括PA101+8,PA301+8;存储设备建立LA38到聚合地址6的映射,其中,聚合地址 6包括PA109+8,PA309+8,同样可以减少存储设备中的映射数量,具体聚合的数据块的物理地址长度可根据具体实现限定,本发明对此不作限定。
另一种实现方式,存储设备建立LA30与聚合地址的映射具体包括存储设备建立建立LA30与聚合指纹的元数据的地址标识的映射,以及建立聚合指纹的元数据的地址标识与聚合数据块的物理地址的映射,其中,聚合指纹的元数据的地址标识包括逻辑地址为LA30的数据块对应的指纹的元数据的地址标识以及LA30到LA45对应的数据块的指纹的元数据的地址标识的数量,聚合数据块的物理地址包括存储逻辑地址为LA30的数据块的物理地址PAB101以及容器2中存储逻辑地址从LA30到LA45的数据块的物理地址长度,其中一种表示方式为LA30到PA301+16的映射,以及PA301+16到PA101+16的映射,表示为LA30--->PA301+16,PA301+16--->PA101+16,具体表示可参考前面描述的实现。可选的,存储设备建立LA30与聚合指纹的元数据的地址标识7的映射,以及聚合指纹的元数据的地址标识7与聚合数据块的物理地址7的映射,其中,聚合指纹的元数据的地址标识7包括PA201+8,聚合数据块的物理地址7包括PA1+8,存储设备建立LA9与聚合指纹的元数据的地址标识8的映射,以及聚合指纹的元数据的地址标识8与聚合数据块的物理地址8的映射,其中,聚合指纹的元数据的地址标识8包括PA209+8,聚合数据块的物理地址8包括PA9+8,同样可以减少存储设备中的映射数量,具体聚合地址的长度可根据具体实现限定,本发明对此不作限定。
另一种实现方式,存储设备建立LA30与聚合地址的映射具体包括存储设备建立LA30与聚合数据块的物理地址的映射,以及建立聚合数据块的物理地址与聚合的指纹的元数据的地址标识的映射,表示为LA30--->PA101+16,PA101+16--->PA301+16,具体表示可参考前面描述的实现,在此不再赘述。
进一步的,存储设备建立指纹的索引,以方便后续进行重复数据删除操作中的指纹查找,减少存储设备缓存指纹的元数据的数量。具体的,可以用数据流1和数据流2中数据块的指纹的元数据中的指纹除以特定整数得到的余数满足特定值的指纹作为指纹的索引中的代表指纹(也称为抽样指纹),例如,指纹除以10余数为3的指纹作为抽样指纹,建立该指纹与该指纹的元数据的地址标识的映射。另一种实现,可以从容器3和容器4存储的指纹的元数据中随机抽取或按一定间隔抽取指纹的元数据中的指纹作为抽样指纹。本发明实施例以从容器3和容器4存储的指纹的元数据中按一定间隔抽取指纹的元数据中的指纹作为抽样指纹则得到如图3所示的指纹的索引。
存储设备加载如图3所示的指纹的索引,以进行重复数据删除操作中的指纹查询。
进一步的,存储设备接收数据流3,如图4所示,将数据流3划分数据块并计算数据块的指纹,其中,逻辑地址为LA61的数据块与逻辑地址为LA1的数据块相同,逻辑地址为LA62的数据块与逻辑地址为LA2的数据块相同,LA63的数据块与逻辑地址为LA3的数据块相同,LA64的数据块与逻辑地址为LA4的数据块相同,LA65的数据块 与逻辑地址为LA5的数据块相同,LA66的数据块与逻辑地址为LA6的数据块相同,LA67的数据块与逻辑地址为LA7的数据块相同,LA68的数据块与逻辑地址为LA8的数据块相同,LA69的数据块与逻辑地址为LA9的数据块相同,LA70的数据块与逻辑地址为LA10的数据块相同,LA71的数据块与逻辑地址为LA11的数据块相同。在逻辑地址为LA1-LA11的数据块与逻辑地址为LA61-LA71的数据块中,称逻辑地址为LA1的数据块与逻辑地址为LA61的数据块具有相同数据块序列位置,逻辑地址为LA2的数据块与逻辑地址为LA62的数据块具有相同数据块序列位置,依次类推,逻辑地址为LA11的数据块与逻辑地址为LA71的数据块具有相同数据块序列位置。逻辑地址为LA61-LA71的数据块的指纹依次为FP1-FP11。
存储设备加载图3所示的指纹的索引,查找指纹的索引中是否存在与数据流3的数据块的相同的指纹。本发明实施例中,指纹的索引中包含指纹FP1、FP5、FP9、FP13、FP30、FP34、FP38和FP42。存储设备确定指纹的索引中与数据流3的数据块的相同的指纹分别为FP1、FP5、FP9和FP13。根据数据块重复局部性,则根据指纹的索引中FP1、FP5、FP9和FP13对应的指纹的元数据的地址标识加载指纹的元数据FP1与PA1、FP5与PA9、FP9与PA9、FP13与PA13的同时,加载FP2与PA2、FP3与PA3、FP4与PA4、FP6与PA6、FP7与PA7、FP8与PA8、FP10与PA10、FP11与PA11、FP12与PA12、FP14与PA14、FP15与PA15和FP16与PA16。
存储设备查找指纹的元数据,确定逻辑地址为LA61-LA71的数据 块为重复的数据块,并且则不再存储逻辑地址为LA61-LA71的数据块。在存储设备中,逻辑地址为LA61-LA71的数据块对应的唯一数据块的物理地址依次为PA1-PA11,建立LA61与聚合地址的映射,具体实现请参见前面实施例描述。其中一种实现,存储设备建立LA61与聚合地址的映射具体包括存储设备建立LA61与PA1+11和PA201+11的映射,表示为LA61--->PA1+11和PA201+11,具体表示可参考前面描述的实现。另一种实现方式,存储设备建立LA61与聚合地址的映射具体包括存储设备建立LA61与聚合指纹的元数据的地址标识的映射,以及建立聚合指纹的元数据的地址标识与聚合数据块的物理地址的映射,具体为LA61到PA201+11的映射,以及PA201+11到PA1+11的映射,表示为LA61--->PA201+11,PA201+11--->PA1+11,具体表示可参考前面描述的实现。另一种实现方式,存储设备建立LA61与聚合地址的映射具体包括存储设备建立LA61与聚合数据块的物理地址的映射,以及建立聚合数据块的物理地址与聚合指纹的元数据的地址标识的映射,表示为LA61--->PA1+11,PA1+11--->PA201+1,具体表示可参考前面描述的实现,在此不再赘述。
逻辑地址为LA72-LA76的数据块为不重复的数据块,按照前面描述的实施方式,为保持数据块重复局部性,按照逻辑地址LA72-LA76的顺序将数据块存储到容器5(本发明实施例中,容器5的首个物理地址为PA401)连续的物理地址中,分别记为物理地址PA401-PA405。将逻辑地址为LA72-LA76的数据块的指纹的元数据按照数据块的逻辑地址LA72-LA76的顺序在容器6(本发明实施例中,容器5的首个物 理地址为PA501)的物理地址上连续的存储,即将FP72与PA401存储到PA501,将FP73与PA402存储到PA502,将FP74与PA403存储到PA503,将FP75与PA404存储到PA504,将FP76与PA405存储到PA505分别记为物理地址PA501-PA505。存储设备建立指纹的元数据的地址标识到指纹的元数据的映射,即建立PA501到FP72与PA401的映射,建立PA502到FP73与PA402的映射,建立PA503到FP74与PA403的映射,建立PA504到FP75与PB404的映射,建立PA505到FP76与PB405的映射。由于逻辑地址为LA72-LA76的不重复的数据块在物理地址上连续存储,即依次存储到PA401-PA405,而逻辑地址为LA72-LA76的不重复的数据块的指纹的元数据在容器6中的物理地址上也连续存储,建立LA72与聚合地址的映射,按照本发明实施例前面描述的方法,一种实现,存储设备建立LA72与聚合地址的映射具体包括建立LA72与PA501+5和PA601+5,表示为LA72--->PA601+5和PA601+5,具体表示可参考前面描述的实现。另一种实现方式,存储设备建立LA72与聚合地址的映射具体包括存储设备建立LA72与聚合指纹的元数据的地址标识的映射,以及建立聚合指纹的元数据的地址标识与聚合数据块的物理地址的映射,具体为LA72与PA601+5的映射,以及PA601+5到PA501+5的映射,表示为LA72--->PA601+5,PA601+5--->PA501+5,具体表示可参考前面描述的实现。另一种实现方式,存储设备建立LA72与聚合地址的映射具体包括存储设备建立LA72与聚合数据块的物理地址的映射,以及建立聚合数据块的物理地址与聚合指纹的元数据的地址标识的映射,表示为LA72--->PA501+5,PA501+5---> PA601+5,具体表示可参考前面描述的实现,在此不再赘述。
存储设备对LA64-LA68的不重复的数据块的指纹进行抽样,建立指纹的索引,本发明实施例以从容器6存储的指纹的元数据中按一定间隔抽取指纹的元数据中的指纹作为抽样指纹,则得到在图3基础上的新的指纹的索引,如图5所示。
本发明实施例中,以存储设备建立LA1--->PA1+16和PA201+16的映射为例,当存储设备接收数据读请求,数据读请求携带的逻辑地址为LA2,则存储设备查询LA1--->PA1+16和PA201+16的映射,确定LA2与LA1相差一个逻辑地址,则存储设备从LA1偏移一个逻辑地址对应的物理地址读取数据。
实际实现中,存储设备使用压缩算法对存储的唯一数据块进行压缩。存储设备压缩唯一数据块会设置压缩窗口,压缩窗口是指一次可以压缩的数据块长度,因此,本发明实施列中,聚合数据块的物理地址长度不超过压缩窗口。可选的,存储设备在建立逻辑地址到聚合地址之前,查询所述存储设备的压缩窗口,确定聚合的数据块的物理地址长度不超过压缩窗口。例如,本发明实施例中,逻辑地址为LA1-LA16的数据块可以建立LA1--->PA1+16和PA201+16的映射,但如果聚合的数据块的物理地址长度超过压缩窗口,则可建立多条映射,例如LA1--->PA1+8和PA201+8的映射以及LA9--->PA9+8和PA209+8的映射。存储设备根据压缩窗口对存储的非重复数据进行压缩。
本发明实施例中,也可以将容器5存储的不重复数据和容器6中存储的指纹的元数据存储在同一个容器中,例如容器5,本发明实施例 对此不作限定。
本发明实施例中,例如,本发明实施例中,逻辑地址为LA1-LA16的数据块可以建立LA1--->PA1+16和PA201+16的映射,上述映射中包含映射地址方向标识,用于指示从LBA1开始按照逻辑地址递增的顺序寻址。可选的,可以通过聚合数据块的物理地址和聚合指纹的元数据的地址标识来体现,例如,所述聚合数据块的物理地址从PA1开始递增以及从聚合指纹的元数据的地址标识从PA201开始递增。相对应的,另一种实现方式,可以建立LBA16--->PA16-16和PA216-16的映射,同样可以减少映射的数量,节省存储设备的内存空间,其中,上述映射中包含映射地址方向标识,用于指示从LBA16开始按照逻辑地址递减的顺序寻址。可选的,可以通过聚合数据块的物理地址和聚合指纹的元数据的地址标识中来体现,例如,所述聚合数据块的物理地址从PA16开始递减以及所述聚合指纹的元数据的地址标识从PA216开始递减。本发明实施例不再赘述。
本发明实施例以使用定长分块算法为例对数据流划分数据块,在另一种场景中,也可以使用变长分块算法,例如内容分块(Content-Defined Chunking,CDC)算法对数据流划分数据块。本发明实施例中的存储设备可以实现文件系统中的重复数据删除操作,例如,网络附加系统(Network Attached System,NAS),则本发明实施例中的逻辑地址为文件标识加偏移地址。本发明实施例中的存储设备也可以实现数据块的重复数据删除操作,例如存储区域网络(Storage Area Network,SAN),则本发明实施列中逻辑地址为逻辑块地址 (logical block address,LBA)。
本发明实施例中的指纹的元数据的地址标识也可以为唯一标识该指纹的元数据的逻辑标识,存储设备可以为唯一的数据块对应的指纹的元数据分配一个全局唯一标识,并且逻辑地址连续的多个唯一数据块的指纹的元数据的地址标识线性递增,例如,逻辑地址为LA1-LA16的数据块的指纹的元数据的地址标识可以依次为块标识(Chunk Identifier,简称Chunk ID)1-Chunk ID16,具体实现中,存储设备对一个数据流进行重复数据删除操作,确定该数据流中逻辑地址连续的唯一数据块,存储设备将该数据流中逻辑地址连续的唯一块的指纹的元数据按照这些唯一数据块的逻辑地址顺序依次存储到容器的物理地址,存储设备按照这些唯一数据块的逻辑地址顺序为这些唯一数据块的指纹的元数据生成全局唯一Chunk ID,这些Chunk ID按照这些唯一数据块的逻辑地址的顺序线性递增。
本发明实施例中,同一个数据流的唯一数据块和唯一数据块的指纹的元数据分别存储到不同的容器中;另一种实现,同一个数据流的唯一数据块和唯一数据块的指纹的元数据也可以存储在同一个容器的不同存储区域。
本发明实施例中使用容器存储唯一的数据块和指纹的元数据。另一种实现,也可以使用树的结构来存储唯一的数据块和指纹的元数据,具体实现中,树的叶子节点可以用来存储唯一的数据块和指纹的元数据。
可选的,本发明实施例也可以只对逻辑地址连续的重复数据块建 立本发明实施例中的逻辑地址到聚合地址的映射,对逻辑地址连续的非重复数据块按照现有实现建立一一映射。
如图6所示,本发明实施例提供了一种存储设备600,包括接收单元601、划分单元602、计算单元603、存储单元604和建立单元605。其中,接收单元601用于接收第一数据流;划分单元602用于划分所述第一数据流得到n个数据块;所述n个数据块的逻辑地址连续;所述n个数据块包括第一数据块,所述第一数据块的逻辑地址为所述n个数据块对应的逻辑地址中的首地址;n为不小于2的整数;计算单元603用于计算所述n个数据块得到所述n个数据块的指纹;存储单元604用于用于当存储设备600中没有查找到与所述n个数据块的指纹中任一指纹相同的指纹时,将所述n个数据块按照所述n个数据块的逻辑地址的顺序连续存储到第一存储区域,将所述n个数据块的指纹的元数据按照所述n个数据块逻辑地址的顺序连续存储到第二存储区域;其中,所述第一存储区域中存储所述第一数据块的物理地址为第一物理地址;所述n个数据块的指纹中的任一指纹的元数据包括所述任一指纹以及所述第二存储区域中存储所述任一指纹的物理地址;建立单元605用于建立所述n个数据块的指纹中的每一指纹的元数据的地址标识与元数据的映射,建立所述第一数据块的逻辑地址与聚合地址的映射,其中,所述聚合地址包括聚合数据块的物理地址和聚合指纹的元数据的地址标识;所述聚合数据块的物理地址包括所述第一物理地址以及所述第一存储区域中存储所述n个数据块的物理地址长度;所述聚合指纹的元数据的地址标识包括所述第一数据块的指纹的元数据 的地址标识以及所述n个数据块的指纹的元数据的地址标识的数量。
本发明实施例中,存储设备减少了映射的数量,从而节省了存储设备的内存空间,同时可以根据映射关系确定是否需要删除指纹的元数据。
可选的,存储设备600中第一存储区域和第二存储区域为容器。进一步的,第一存储区域和第二存储区域可以为同一存储区域。
可选的,建立单元605具体用于建立所述第一数据块的逻辑地址与所述聚合数据块的物理地址和所述聚合指纹的元数据的地址标识的映射。
可选的,建立单元605具体用于建立所述第一数据块的逻辑地址与所述聚合指纹的元数据的地址标识映射以及所述聚合指纹的元数据的地址标识与所述聚合数据块的物理地址的映射。
可选的,建立单元605具体用于建立所述第一数据块的逻辑地址与所述聚合数据块的物理地址的映射以及所述聚合数据块的物理地址与所述聚合指纹的元数据的地址标识的映射。
可选的,存储设备600还包括确定单元,确定单元用于建立所述第一数据块的逻辑地址与所述聚合地址的映射之前,确定所述第一存储区域中存储的所述n个数据块的物理地址长度不超过存储设备的压缩窗口。可选的,存储设备600还包括压缩单元,压缩单元用于根据所述压缩窗口压缩所述第一存储区域中存储的所述n个数据块。
可选的,接收单元601,还用于接收第二数据流;划分单元602还用于划分所述第二数据流得到n个数据块;所述第二数据流的n个数 据块的逻辑地址连续;所述第二数据流的n个数据块包括第二数据块,所述第二数据块的逻辑地址为所述第二数据流的n个数据块对应的逻辑地址中的首地址;计算单元603还用于还用于计算所述第二数据流的n个数据块得到所述第二数据流的n个数据块的指纹;建立单元605还用于还用于当所述存储设备600查询所述第一数据流的n个数据块的指纹的元数据确定相同数据块序列位置中的所述第二数据流的n个数据块与所述第一数据块的n个数据块的指纹相同时,建立所述第二数据块的逻辑地址与所述聚合地址的映射;其中,所述数据块序列位置是指在所述第一数据流和所述第二数据流的任一数据流中,每一个数据块在n个数据块中的相对位置。
可选的,建立单元605还用于建立所述第一数据流的所述n个数据块的指纹中的第一指纹的索引,所述第一指纹的索引包括所述第一指纹与所述第一指纹的元数据的地址标识的映射。
本发明实施例提供的存储设备600,具体功能及实现可参考前面实施例描述的方法及步骤,在此不再赘述。
如图6所示的存储设备600,一种实现方式为存储设备600安装有上述单元,上述单元可被加载到存储设备600的内存中,由存储设备600中的CPU执行内存中的指令,实现本发明对应的实施例中的功能;另一种实现,存储设备600中包含的单元可以由硬件来实现,或者由硬件与CPU执行内存中的指令组合实现。上述单元也称为结构单元。
本发明实施例,还提供了非易失性计算机可读存储介质和计算机程序产品,当图6所示的存储设备600的CPU的内存加载非易失性计算 机可读存储介质和计算机程序产品中包含的计算机指令,CPU执行内存中加载的该计算机指令,以实现本发明各实施中对应的功能。
在本发明所提供的几个实施例中,应该理解到,所公开的装置、方法,可以通过其它的方式实现。例如,以上所描述的装置实施例所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。

Claims (24)

  1. 一种重复数据删除方法,其特征在于,包括:
    存储设备接收第一数据流;
    所述存储设备划分所述第一数据流得到n个数据块;所述n个数据块的逻辑地址连续;所述n个数据块包括第一数据块,所述第一数据块的逻辑地址为所述n个数据块对应的逻辑地址中的首地址;n为不小于2的整数;
    所述存储设备计算所述n个数据块得到所述n个数据块的指纹;
    当所述存储设备中没有查找到与所述n个数据块的指纹中任一指纹相同的指纹时,所述存储设备将所述n个数据块按照所述n个数据块的逻辑地址的顺序连续存储到第一存储区域;其中,所述第一存储区域中存储所述第一数据块的物理地址为第一物理地址;
    所述存储设备将所述n个数据块的指纹的元数据按照所述n个数据块逻辑地址的顺序连续存储到第二存储区域;所述n个数据块的指纹中的任一指纹的元数据包括所述任一指纹以及所述第二存储区域中存储所述任一指纹的物理地址;
    所述存储设备建立所述n个数据块的指纹中的每一指纹的元数据的地址标识与元数据的映射;
    所述存储设备建立所述第一数据块的逻辑地址与聚合地址的映射,其中,所述聚合地址包括聚合数据块的物理地址和聚合指纹的元数据的地址标识;所述聚合数据块的物理地址包括所述第一物理地址以及所述第一存储区域中存储所述n个数据块的物理地址长度;所述 聚合指纹的元数据的地址标识包括所述第一数据块的指纹的元数据的地址标识以及所述n个数据块的指纹的元数据的地址标识的数量。
  2. 根据权利要求1所述的方法,其特征在于,所述存储设备建立所述第一数据块的逻辑地址与所述聚合地址的映射,具体包括:
    所述存储设备建立所述第一数据块的逻辑地址与所述聚合数据块的物理地址和所述聚合指纹的元数据的地址标识的映射。
  3. 根据权利要求1所述的方法,其特征在于,所述存储设备建立所述第一数据块的逻辑地址与所述聚合地址的映射,具体包括:
    所述存储设备建立所述第一数据块的逻辑地址与所述聚合指纹的元数据的地址标识映射以及所述聚合指纹的元数据的地址标识与所述聚合数据块的物理地址的映射。
  4. 根据权利要求1所述的方法,其特征在于,所述存储设备建立所述第一数据块的逻辑地址与所述聚合地址的映射,具体包括:
    所述存储设备建立所述第一数据块的逻辑地址与所述聚合数据块的物理地址的映射以及所述聚合数据块的物理地址与所述聚合指纹的元数据的地址标识的映射。
  5. 根据权利要求1至4任一所述的方法,其特征在于,所述方法还包括:
    所述存储设备建立所述第一数据块的逻辑地址与所述聚合地址的映射之前,所述存储设备确定所述第一存储区域中存储的所述n个数据块的物理地址长度不超过存储设备的压缩窗口。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括: 所述存储设备根据所述压缩窗口压缩所述第一存储区域中存储的所述n个数据块。
  7. 根据权利要求1至4任一所述的方法,其特征在于,还包括:
    所述存储设备接收第二数据流;
    所述存储设备划分所述第二数据流得到n个数据块;所述第二数据流的n个数据块的逻辑地址连续;所述第二数据流的n个数据块包括第二数据块,所述第二数据块的逻辑地址为所述第二数据流的n个数据块对应的逻辑地址中的首地址;
    存储设备计算所述第二数据流的n个数据块得到所述第二数据流的n个数据块的指纹;
    当所述存储设备查询所述第一数据流的n个数据块的指纹的元数据确定相同数据块序列位置中的所述第二数据流的n个数据块与所述第一数据块的n个数据块的指纹相同,则所述存储设备建立所述第二数据块的逻辑地址与所述聚合地址的映射;其中,所述数据块序列位置是指在所述第一数据流和所述第二数据流的任一数据流中,每一个数据块在n个数据块中的相对位置。
  8. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    所述存储设备建立所述第一数据流的所述n个数据块的指纹中的第一指纹的索引,所述第一指纹的索引包括所述第一指纹与所述第一指纹的元数据的地址标识的映射。
  9. 一种存储设备,其特征在于,包括:
    接收单元,用于接收第一数据流;
    划分单元,用于划分所述第一数据流得到n个数据块;所述n个数据块的逻辑地址连续;所述n个数据块包括第一数据块,所述第一数据块的逻辑地址为所述n个数据块对应的逻辑地址中的首地址;n为不小于2的整数;
    计算单元,用于计算所述n个数据块得到所述n个数据块的指纹;
    存储单元,用于当所述存储设备中没有查找到与所述n个数据块的指纹中任一指纹相同的指纹时,将所述n个数据块按照所述n个数据块的逻辑地址的顺序连续存储到第一存储区域,将所述n个数据块的指纹的元数据按照所述n个数据块逻辑地址的顺序连续存储到第二存储区域;其中,所述第一存储区域中存储所述第一数据块的物理地址为第一物理地址;所述n个数据块的指纹中的任一指纹的元数据包括所述任一指纹以及所述第二存储区域中存储所述任一指纹的物理地址;
    建立单元,用于建立所述n个数据块的指纹中的每一指纹的元数据的地址标识与元数据的映射,建立所述第一数据块的逻辑地址与聚合地址的映射,其中,所述聚合地址包括聚合数据块的物理地址和聚合指纹的元数据的地址标识;所述聚合数据块的物理地址包括所述第一物理地址以及所述第一存储区域中存储所述n个数据块的物理地址长度;所述聚合指纹的元数据的地址标识包括所述第一数据块的指纹的元数据的地址标识以及所述n个数据块的指纹的元数据的地址标识的数量。
  10. 根据权利要求9所述的存储设备,其特征在于,所述建立单 元,具体用于建立所述第一数据块的逻辑地址与所述聚合数据块的物理地址和所述聚合指纹的元数据的地址标识的映射。
  11. 根据权利要求9所述的存储设备,其特征在于,所述建立单元,具体用于建立所述第一数据块的逻辑地址与所述聚合指纹的元数据的地址标识映射以及所述聚合指纹的元数据的地址标识与所述聚合数据块的物理地址的映射。
  12. 根据权利要求9所述的存储设备,其特征在于,所述建立单元,具体用于建立所述第一数据块的逻辑地址与所述聚合数据块的物理地址的映射以及所述聚合数据块的物理地址与所述聚合指纹的元数据的地址标识的映射。
  13. 根据权利要求9至12任一所述的存储设备,其特征在于,所述存储设备还包括确定单元:所述确定单元用于建立所述第一数据块的逻辑地址与所述聚合地址的映射之前,确定所述第一存储区域中存储的所述n个数据块的物理地址长度不超过存储设备的压缩窗口。
  14. 根据权利要求13所述的存储设备,其特征在于,所述存储设备还包括压缩单元,所述压缩单元用于根据所述压缩窗口压缩所述第一存储区域中存储的所述n个数据块。
  15. 根据权利要求9至12任一所述的存储设备,其特征在于,
    所述接收单元还用于接收第二数据流;
    所述划分单元还用于划分所述第二数据流得到n个数据块;所述第二数据流的n个数据块的逻辑地址连续;所述第二数据流的n个数据块包括第二数据块,所述第二数据块的逻辑地址为所述第二数据流的 n个数据块对应的逻辑地址中的首地址;
    所述计算单元还用于计算所述第二数据流的n个数据块得到所述第二数据流的n个数据块的指纹;
    所述建立单元还用于当所述存储设备查询所述第一数据流的n个数据块的指纹的元数据确定相同数据块序列位置中的所述第二数据流的n个数据块与所述第一数据块的n个数据块的指纹相同时,建立所述第二数据块的逻辑地址与所述聚合地址的映射;其中,所述数据块序列位置是指在所述第一数据流和所述第二数据流的任一数据流中,每一个数据块在n个数据块中的相对位置。
  16. 根据权利要求9所述的存储设备,其特征在于,所述建立单元还用于建立所述第一数据流的所述n个数据块的指纹中的第一指纹的索引,所述第一指纹的索引包括所述第一指纹与所述第一指纹的元数据的地址标识的映射。
  17. 一种存储设备,其特征在于,所述存储设备包括接口和处理器,所述接口与所述处理器通信;其中,
    所述接口用于接收第一数据流;
    所述处理器用于:
    划分所述第一数据流得到n个数据块;所述n个数据块的逻辑地址连续;所述n个数据块包括第一数据块,所述第一数据块的逻辑地址为所述n个数据块对应的逻辑地址中的首地址;n为不小于2的整数;
    计算所述n个数据块得到所述n个数据块的指纹;
    当所述存储设备中没有查找到与所述n个数据块的指纹中任一指 纹相同的指纹时,将所述n个数据块按照所述n个数据块的逻辑地址的顺序连续存储到第一存储区域;其中,所述第一存储区域中存储所述第一数据块的物理地址为第一物理地址;
    将所述n个数据块的指纹的元数据按照所述n个数据块逻辑地址的顺序连续存储到第二存储区域;所述n个数据块的指纹中的任一指纹的元数据包括所述任一指纹以及所述第二存储区域中存储所述任一指纹的物理地址;
    建立所述n个数据块的指纹中的每一指纹的元数据的地址标识与元数据的映射;
    建立所述第一数据块的逻辑地址与聚合地址的映射,其中,所述聚合地址包括聚合数据块的物理地址和聚合指纹的元数据的地址标识;所述聚合数据块的物理地址包括所述第一物理地址以及所述第一存储区域中存储所述n个数据块的物理地址长度;所述聚合指纹的元数据的地址标识包括所述第一数据块的指纹的元数据的地址标识以及所述n个数据块的指纹的元数据的地址标识的数量。
  18. 根据权利要求17所述的存储设备,其特征在于,所述处理器具体用于建立所述第一数据块的逻辑地址与所述聚合数据块的物理地址和所述聚合指纹的元数据的地址标识的映射。
  19. 根据权利要求17所述的存储设备,其特征在于,所述处理器具体用于建立所述第一数据块的逻辑地址与所述聚合指纹的元数据的地址标识映射以及所述聚合指纹的元数据的地址标识与所述聚合数据块的物理地址的映射。
  20. 根据权利要求17所述的存储设备,其特征在于,所述处理器具体用于建立所述第一数据块的逻辑地址与所述聚合数据块的物理地址的映射以及所述聚合数据块的物理地址与所述聚合指纹的元数据的地址标识的映射。
  21. 根据权利要求17至20任一所述的存储设备,其特征在于,所述处理器还用于建立所述第一数据块的逻辑地址与所述聚合地址的映射之前,确定所述第一存储区域中存储的所述n个数据块的物理地址长度不超过存储设备的压缩窗口。
  22. 根据权利要求21所述的存储设备,其特征在于,所述处理器还用于根据所述压缩窗口压缩所述第一存储区域中存储的所述n个数据块。
  23. 根据权利要求17至20任一所述的存储设备,其特征在于,
    所述接口还用于接收第二数据流;
    所述处理器还用于:
    划分所述第二数据流得到n个数据块;所述第二数据流的n个数据块的逻辑地址连续;所述第二数据流的n个数据块包括第二数据块,所述第二数据块的逻辑地址为所述第二数据流的n个数据块对应的逻辑地址中的首地址;
    存储设备计算所述第二数据流的n个数据块得到所述第二数据流的n个数据块的指纹;
    当所述存储设备查询所述第一数据流的n个数据块的指纹的元数据确定相同数据块序列位置中的所述第二数据流的n个数据块与所述 第一数据块的n个数据块的指纹相同,则所述存储设备建立所述第二数据块的逻辑地址与所述聚合地址的映射;其中,所述数据块序列位置是指在所述第一数据流和所述第二数据流的任一数据流中,每一个数据块在n个数据块中的相对位置。
  24. 根据权利要求17所述的存储设备,其特征在于,所述处理器还用于建立所述第一数据流的所述n个数据块的指纹中的第一指纹的索引,所述第一指纹的索引包括所述第一指纹与所述第一指纹的元数据的地址标识的映射。
PCT/CN2015/099572 2015-12-29 2015-12-29 重复数据删除方法及存储设备 WO2017113123A1 (zh)

Priority Applications (7)

Application Number Priority Date Filing Date Title
JP2018500840A JP6537214B2 (ja) 2015-12-29 2015-12-29 重複排除方法および記憶デバイス
PCT/CN2015/099572 WO2017113123A1 (zh) 2015-12-29 2015-12-29 重复数据删除方法及存储设备
KR1020177026169A KR102082765B1 (ko) 2015-12-29 2015-12-29 중복 제거 방법 및 저장 장치
CN201580002563.7A CN107430602B (zh) 2015-12-29 2015-12-29 重复数据删除方法及存储设备
EP15911754.8A EP3264285A4 (en) 2015-12-29 2015-12-29 Data deduplication method and storage device
SG11201707075SA SG11201707075SA (en) 2015-12-29 2015-12-29 Deduplication method and storage device
US15/959,273 US10613976B2 (en) 2015-12-29 2018-04-22 Method and storage device for reducing data duplication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/099572 WO2017113123A1 (zh) 2015-12-29 2015-12-29 重复数据删除方法及存储设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/959,273 Continuation US10613976B2 (en) 2015-12-29 2018-04-22 Method and storage device for reducing data duplication

Publications (1)

Publication Number Publication Date
WO2017113123A1 true WO2017113123A1 (zh) 2017-07-06

Family

ID=59224199

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/099572 WO2017113123A1 (zh) 2015-12-29 2015-12-29 重复数据删除方法及存储设备

Country Status (7)

Country Link
US (1) US10613976B2 (zh)
EP (1) EP3264285A4 (zh)
JP (1) JP6537214B2 (zh)
KR (1) KR102082765B1 (zh)
CN (1) CN107430602B (zh)
SG (1) SG11201707075SA (zh)
WO (1) WO2017113123A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109309706A (zh) * 2017-07-27 2019-02-05 Emc知识产权控股有限公司 在云局域网上的存储系统之间共享预先计算的指纹和数据块的方法和系统
CN112328641A (zh) * 2021-01-05 2021-02-05 平安国际智慧城市科技股份有限公司 多维度数据聚合方法、装置及计算机设备
CN113608701A (zh) * 2021-08-18 2021-11-05 合肥大唐存储科技有限公司 一种存储系统中数据管理方法和固态硬盘
US20220253222A1 (en) * 2019-11-01 2022-08-11 Huawei Technologies Co., Ltd. Data reduction method, apparatus, computing device, and storage medium
US11461269B2 (en) * 2017-07-21 2022-10-04 EMC IP Holding Company Metadata separated container format

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108415669A (zh) * 2018-03-15 2018-08-17 深信服科技股份有限公司 存储系统的数据去重方法及装置、计算机装置及存储介质
US11055005B2 (en) 2018-10-12 2021-07-06 Netapp, Inc. Background deduplication using trusted fingerprints
US10852975B2 (en) * 2019-01-24 2020-12-01 EMC IP Holding Company LLC Efficient data deduplication caching
US11392551B2 (en) * 2019-02-04 2022-07-19 EMC IP Holding Company LLC Storage system utilizing content-based and address-based mappings for deduplicatable and non-deduplicatable types of data
CN114077569B (zh) * 2020-08-18 2023-07-18 富泰华工业(深圳)有限公司 压缩数据的方法及设备、解压缩数据的方法及设备
CN111949624B (zh) * 2020-09-11 2022-09-20 苏州浪潮智能科技有限公司 一种数据重删操作的pl超限控制方法、装置及可读存储介质
US11385817B2 (en) * 2020-09-22 2022-07-12 Vmware, Inc. Supporting deduplication in object storage using subset hashes
CN112463077B (zh) * 2020-12-16 2021-11-12 北京云宽志业网络技术有限公司 数据块处理方法、装置、设备及存储介质
JP7215804B2 (ja) * 2021-05-14 2023-01-31 Necプラットフォームズ株式会社 ストレージ装置、情報処理システム、情報処理方法、およびプログラム
US11874821B2 (en) 2021-12-22 2024-01-16 Ebay Inc. Block aggregation for shared streams

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100281207A1 (en) * 2009-04-30 2010-11-04 Miller Steven C Flash-based data archive storage system
US20110055471A1 (en) * 2009-08-28 2011-03-03 Jonathan Thatcher Apparatus, system, and method for improved data deduplication
CN103514250A (zh) * 2013-06-20 2014-01-15 易乐天 一种全局重复数据删除的方法和系统及存储装置
CN104239518A (zh) * 2014-09-17 2014-12-24 华为技术有限公司 重复数据删除方法和装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7814129B2 (en) * 2005-03-11 2010-10-12 Ross Neil Williams Method and apparatus for storing data with reduced redundancy using data clusters
JP4768009B2 (ja) * 2005-03-11 2011-09-07 ロックソフト リミテッド データ・クラスタを使用する冗長性の少ないデータを格納する方法
WO2011133443A1 (en) * 2010-04-19 2011-10-27 Greenbytes, Inc. A method for optimizing the memory usage and performance of data deduplication storage systems
US10394757B2 (en) * 2010-11-18 2019-08-27 Microsoft Technology Licensing, Llc Scalable chunk store for data deduplication
US9823981B2 (en) * 2011-03-11 2017-11-21 Microsoft Technology Licensing, Llc Backup and restore strategies for data deduplication
US9678863B2 (en) * 2012-06-12 2017-06-13 Sandisk Technologies, Llc Hybrid checkpointed memory
US10318495B2 (en) * 2012-09-24 2019-06-11 Sandisk Technologies Llc Snapshots for a non-volatile device
US8954392B2 (en) * 2012-12-28 2015-02-10 Futurewei Technologies, Inc. Efficient de-duping using deep packet inspection
US9141554B1 (en) * 2013-01-18 2015-09-22 Cisco Technology, Inc. Methods and apparatus for data processing using data compression, linked lists and de-duplication techniques
KR20140114515A (ko) * 2013-03-15 2014-09-29 삼성전자주식회사 불휘발성 메모리 장치 및 그것의 중복 데이터 제거 방법
KR101532283B1 (ko) * 2013-11-04 2015-06-30 인하대학교 산학협력단 Ssd 기반 raid 스토리지에서 데이터 및 패리티 디스크의 복합적 중복제거 방법

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100281207A1 (en) * 2009-04-30 2010-11-04 Miller Steven C Flash-based data archive storage system
US20110055471A1 (en) * 2009-08-28 2011-03-03 Jonathan Thatcher Apparatus, system, and method for improved data deduplication
CN103514250A (zh) * 2013-06-20 2014-01-15 易乐天 一种全局重复数据删除的方法和系统及存储装置
CN104239518A (zh) * 2014-09-17 2014-12-24 华为技术有限公司 重复数据删除方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3264285A4 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11461269B2 (en) * 2017-07-21 2022-10-04 EMC IP Holding Company Metadata separated container format
CN109309706A (zh) * 2017-07-27 2019-02-05 Emc知识产权控股有限公司 在云局域网上的存储系统之间共享预先计算的指纹和数据块的方法和系统
CN109309706B (zh) * 2017-07-27 2022-03-04 Emc知识产权控股有限公司 在云局域网的存储系统间共享指纹和数据块的方法和系统
US20220253222A1 (en) * 2019-11-01 2022-08-11 Huawei Technologies Co., Ltd. Data reduction method, apparatus, computing device, and storage medium
CN112328641A (zh) * 2021-01-05 2021-02-05 平安国际智慧城市科技股份有限公司 多维度数据聚合方法、装置及计算机设备
CN112328641B (zh) * 2021-01-05 2021-04-20 平安国际智慧城市科技股份有限公司 多维度数据聚合方法、装置及计算机设备
CN113608701A (zh) * 2021-08-18 2021-11-05 合肥大唐存储科技有限公司 一种存储系统中数据管理方法和固态硬盘

Also Published As

Publication number Publication date
KR102082765B1 (ko) 2020-02-28
JP2018514045A (ja) 2018-05-31
CN107430602A (zh) 2017-12-01
US20180267896A1 (en) 2018-09-20
EP3264285A1 (en) 2018-01-03
CN107430602B (zh) 2020-05-08
EP3264285A4 (en) 2018-05-30
JP6537214B2 (ja) 2019-07-03
SG11201707075SA (en) 2017-09-28
KR20170117572A (ko) 2017-10-23
US10613976B2 (en) 2020-04-07

Similar Documents

Publication Publication Date Title
WO2017113123A1 (zh) 重复数据删除方法及存储设备
KR102261811B1 (ko) 데이터 전송의 단일 패스 엔트로피 검출 장치 및 방법
US9715434B1 (en) System and method for estimating storage space needed to store data migrated from a source storage to a target storage
US9779023B1 (en) Storing inline-compressed data in segments of contiguous physical blocks
US8943032B1 (en) System and method for data migration using hybrid modes
US8949208B1 (en) System and method for bulk data movement between storage tiers
US20200150890A1 (en) Data Deduplication Method and Apparatus
JP6110517B2 (ja) データオブジェクト処理方法及び装置
US8615500B1 (en) Partial block allocation for file system block compression using virtual block metadata
KR102052789B1 (ko) 데이터 전송의 단일 패스 엔트로피 검출 장치 및 방법
US10936228B2 (en) Providing data deduplication in a data storage system with parallelized computation of crypto-digests for blocks of host I/O data
WO2015199577A1 (en) Metadata structures for low latency and high throughput inline data compression
US9727479B1 (en) Compressing portions of a buffer cache using an LRU queue
WO2010099715A1 (zh) 数据操作方法、系统、客户端和数据服务器
US20160335024A1 (en) Assisting data deduplication through in-memory computation
CN108415671B (zh) 一种面向绿色云计算的重复数据删除方法及系统
US20180107404A1 (en) Garbage collection system and process
US20220300180A1 (en) Data Deduplication Method and Apparatus, and Computer Program Product
Kim et al. Design and implementation of binary file similarity evaluation system
US10521400B1 (en) Data reduction reporting in storage systems
US20220269431A1 (en) Data processing method and storage device
US20230367477A1 (en) Storage system, data management program, and data management method
CN110968575A (zh) 一种大数据处理系统的去重方法
Tolič et al. Efficient Deduplication in Disk-and RAM-based Data Storage Systems

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 11201707075S

Country of ref document: SG

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15911754

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20177026169

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2018500840

Country of ref document: JP

Kind code of ref document: A

REEP Request for entry into the european phase

Ref document number: 2015911754

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE