US20170038978A1 - Delta Compression Engine for Similarity Based Data Deduplication - Google Patents
Delta Compression Engine for Similarity Based Data Deduplication Download PDFInfo
- Publication number
- US20170038978A1 US20170038978A1 US15/214,243 US201615214243A US2017038978A1 US 20170038978 A1 US20170038978 A1 US 20170038978A1 US 201615214243 A US201615214243 A US 201615214243A US 2017038978 A1 US2017038978 A1 US 2017038978A1
- Authority
- US
- United States
- Prior art keywords
- data block
- block
- new
- sketch
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0608—Saving storage space on storage systems
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
- G06F3/0611—Improving I/O performance in relation to response time
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
- G06F3/0641—De-duplication techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0661—Format or protocol conversion arrangements
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
Definitions
- the present disclosure relates to data compression techniques.
- the present disclosure relates to a hardware embodiment of a delta compression engine for similar chunks of data.
- Data deduplication techniques for improving storage utilization are becoming increasingly important due to explosive growth of data in the world of the Internet and enterprise backup environments.
- Data deduplication involves a data compression technique for eliminating redundant data and thus reducing the amount of storage space needed to save data.
- Data deduplication like other lossless compression techniques are used to reduce the amount of data transfer (e.g., data sent across a WAN for disaster recovery or remote backups) and data store (e.g., data retained on storage media such as tape or disk).
- Lossless compression techniques usually incur trade-offs between compression ratio and speed.
- Classic lossless compression algorithms such as LZ77 or LZO apply byte-level based searching of a dictionary and thus require a large DRAM resource as dictionary storage, which incurs a slower deduplication process.
- Snappy an open source data compression algorithm written in C++, aims at achieving high speed rather than a maximized compression ratio.
- Other conventional deduplication technologies only look at identical data blocks, thus missing opportunities for compression where similar, non-identical, data blocks exist widely in data storage.
- Data deduplication techniques have proven successful in backup systems where duplicate data blocks are prevalent, however, achieving the same success in primary storage, which is mainly used in a production environment, has proven challenging.
- One challenge involves achieving maximized compression ratio in primary storage where similar data blocks, as opposed to duplicate data blocks, are more prevalent.
- Another challenge involves improving performance where the required response time for each data unit in primary storage deduplication systems is much shorter than backup deduplication systems.
- An additional challenge involves the limitation of resources and the slowing down of application performance running on a server. While backup deduplication systems have their own resources, primary storage deduplication systems share resources such as the CPU and RAM utilized in the production environment, which could result in performance degradation of applications running on the server.
- the present disclosure describes a delta compression engine including a block sketch computation module, a reference block indexing module, and a similar block delta compression module.
- the present disclosure further describes methods for delta compression.
- FIG. 1 is a high-level block diagram illustrating an example system including a storage controller having a delta compression engine.
- FIG. 2 is a block diagram illustrating an example system configured to implement the techniques introduced herein.
- FIG. 3 illustrates a block diagram of an example hardware architecture and logical flow of a data through the delta compression engine, according to the techniques described herein.
- FIG. 4 illustrates a two parallel pipeline structure design of the delta compression engine, according to the techniques described herein.
- FIG. 5 is a flow chart of an example method for delta compression encoding a new reference data block, according to the techniques described herein.
- FIG. 6 illustrates an example of delta compression encoding, according to the techniques described herein.
- FIG. 7 illustrates a block diagram of a hardware decompression logic architecture, according to the techniques described herein.
- FIG. 8 is a graphic representation of shingles in a data stream, according to the techniques described herein.
- FIG. 9 is a graphic representation of an incremental computation pipeline design, according to the techniques described herein.
- FIG. 10 is a block diagram illustrating an example block signature module, according to the techniques described herein.
- FIG. 11 illustrates a parallel delta compression encoding structure, according to the techniques described herein.
- a hardware implemented delta compression system and method are needed to provide line speed data deduplication, to improve latency and compression ratio over software delta compression engines running on servers, to improve throughput, to provide for better data reduction ratio over conventional techniques, and to make similarity based deduplication more applicable to primary storage or storage caches.
- the hardware implementation introduced herein provides for improved processing speed for data deduplication of similar data chunks. Delta compression may be processed in line speed, provide high throughput, and fast response time by means of pipelining and parallel data lookup across multiple hardware modules. Additionally, the hardware implementation introduced herein offers an offload of deduplication functions from servers so that application performance is not negatively affected.
- the hardware architecture introduced herein may be implemented on a field-programmable gate array (FPGA). However, the hardware architecture should not be limited to implementation on a FPGA.
- the delta compression engine of the present disclosure may be implemented on other integrated circuits, such as an application-specific integrated circuit (ASIC).
- ASIC application-specific integrated circuit
- Data deduplication is a data compression technique for improving storage utilization by eliminating redundant copies of data.
- Data deduplication techniques are also applicable to data transfer by reducing the size of data, e.g., the number of bytes, sent over a network.
- Data deduplication involves the identification and storage of unique blocks or chunks of data, e.g. byte patterns.
- Data deduplication systems work by retaining a single unique block of data on storage media, such as tape or disk, and referencing the single unique block of data for all data objects that include a matching block of data.
- a delta compression process as introduced herein may involve splitting a file into multiple chunks and generating a fingerprint for each chunk. The fingerprint may be a strong hash digest of the chunk. The delta compression process may further involve determining whether two fingerprints match.
- a new incoming chunk's fingerprint is compared to an existing chunk's fingerprint previously stored in the delta compression system.
- a determination that the two fingerprints match is an indicator that the contents of the chunks are duplicate or identical. If the two fingerprints match, only metadata for the new incoming chunk, such as a file name or logical block address (LBA) and a reference to the existing content, will be stored. For example, a redundant new incoming chunk is not retained however is replaced by a small pointer to the stored existing chunk.
- a similar new incoming chunk is encoded and stored as a small pointer to a stored existing similar chunk and the difference in data between the new incoming chunk and the stored existing chunk.
- the terms block or chunk are used interchangeably in the present disclosure to refer to a basic unit of data deduplication.
- the terms block or chunk may refer to data of different sizes including, but not limited to, a file, data stream, or byte pattern.
- Data blocks and files in primary storage are often modified by functions such as cut, insert, delete, and update and reassembled in different contexts and packages.
- a slightly modified data block may generate a different hash sketch.
- a slightly modified data block will generate a hash sketch different than the original data block.
- the different hash sketch will not be indexed and stored by a standard deduplication process, which is generally determined by the indication of a duplicate or identical match.
- the sketch of the modified block may be the same as the sketch of the pre-modified data block.
- the weaker hash sketches may include e.g. several Rabin fingerprints and have the property that if two data blocks share the same sketch, then the two data blocks contain a lot of the same content, i.e. the two data blocks are likely near-duplicate.
- a new incoming block is compared to a list of reference data blocks to identify a related reference data block by comparing their sketches. If a related reference data block is identified among the list of reference data blocks, a delta compression of the new incoming block is performed against the identified related reference data block and only the delta is stored along with a pointer to the identified related reference data block.
- delta compression can effectively deduplicate data at both file or block levels.
- the central tenet of delta compression is to find the difference between two similar data blocks or chunks and try to retain only one of the two blocks in storage. The difference between the stored block and the remaining block along with a reference to the stored block is stored for the remaining block.
- Delta compression techniques offer deduplication benefit gains of 1.4 times compared to conventional deduplication techniques.
- improvements to the throughput of the system may be achieved through a hardware embodiment making the similarity based deduplicaiton techniques described in the present disclosure more applicable to primary storage or storage caches, (e.g., providing approximately one gigabyte per second throughput and a sub-millisecond in latency).
- FIG. 1 is a high-level block diagram illustrating an example system 100 including a storage controller having a delta compression engine.
- the system 100 includes one or more clients 102 a . . . 102 n , a network 104 , and a storage system including storage controller 106 and storage devices 108 a . . . n .
- the storage controller 106 includes delta compression engine 110 .
- the client devices 102 a . . . 102 n can be any computing device including one or more memory and one or more processors, for example, a laptop computer, a desktop computer, a tablet computer, a mobile telephone, a personal digital assistant (PDA), a mobile email device, a portable game player, a portable music player, a television with one or more processors embedded therein or coupled thereto or any other electronic device capable of making storage requests.
- a client device 102 may execute an application that makes storage requests (e.g., read, write, etc.) to the storage devices 108 . While the example of FIG. 1 includes two clients, 102 a and 102 n , it should be understood that any number of clients 102 may be present in the system.
- Clients may be directly coupled with storage sub-systems including individual storage devices (e.g., storage device 108 a ) via storage controller 106 .
- clients may be indirectly coupled with storage sub-systems including individual storage devices 108 via a separate controller.
- the system 100 includes a storage controller 106 that provides a single interface for the client devices 102 to access the storage devices 112 in the storage system.
- the storage controller 106 may be a computing device configured to make some or all of the storage space on disks 108 available to clients 102 .
- client devices can be coupled to the storage controller 106 via network 104 (e.g., client 102 a ) or directly (e.g., client 102 n ).
- the network 104 can be one of a conventional type, wired or wireless, and may have numerous different configurations including a star configuration, token ring configuration, or other configurations. Furthermore, the network 104 may include a local area network (LAN), a wide area network (WAN) (e.g., the internet), and/or other interconnected data paths across which multiple devices (e.g., storage controller 106 , client device 102 , etc.) may communicate. In some embodiments, the network 104 may be a peer-to-peer network. The network 104 may also be coupled with or include portions of a telecommunications network for sending data using a variety of different communication protocols.
- LAN local area network
- WAN wide area network
- the network 104 may also be coupled with or include portions of a telecommunications network for sending data using a variety of different communication protocols.
- the network 104 may include Bluetooth (or Bluetooth low energy) communication networks or a cellular communications network for sending and receiving data including via short messaging service (SMS), multimedia messaging service (MMS), hypertext transfer protocol (HTTP), direct data connection, WAP, email, etc.
- SMS short messaging service
- MMS multimedia messaging service
- HTTP hypertext transfer protocol
- WAP direct data connection
- email etc.
- FIG. 1 illustrates one network 104
- one or more networks 104 can connect the entities of the system 100 .
- FIG. 2 is a block diagram illustrating an example system 200 configured to implement the techniques introduced herein.
- the system 200 may be a client device 102 .
- the system 200 may be storage controller 106 .
- the system 200 may be implemented as a combination of a client device and storage controller 106 .
- the system 200 includes a network interface (IF) module 202 , a processor 204 , a memory 206 , a storage interface (IF) module 208 , a delta compression engine 110 , and a storage device 216 .
- Delta compression engine 110 includes block signature module 210 , a reference block index module 212 , and a delta encoding module 214 .
- the components of the system 200 are communicatively coupled to a bus or software communication mechanism 220 for communication with each other.
- software communication mechanism 220 may be an object bus (e.g., CORBA), direct socket communication (e.g., TCP/IP sockets) among software modules, remote procedure calls, UDP broadcasts and receipts, HTTP connections, function or procedure calls, etc. Further, any or all of the communication could be secure (SSH, HTTPS, etc.).
- object bus e.g., CORBA
- direct socket communication e.g., TCP/IP sockets
- TCP/IP sockets e.g., TCP/IP sockets
- HTTP connections e.g., HTTP connections, etc.
- the software communication mechanism 220 can be implemented on any underlying hardware, for example, a network, the Internet, a bus, a combination thereof, etc.
- the network interface (I/F) module 202 is configured to connect system 200 to a network and/or other system (e.g., network 104 ).
- network interface module 202 may enable communication through one or more of the internet, cable networks, and wired networks.
- the network interface module 202 links the processor 204 to the network 104 that may in turn be coupled to other processing systems (e.g., a server).
- the network interface module 202 also provides other conventional connections to the network 104 for distribution and/or retrieval of files and/or media objects using standard network protocols such as TCP/IP, HTTP, HTTPS and SMTP as will be understood.
- the network interface module 202 includes a transceiver for sending and receiving signals using WiFi, Bluetooth® or cellular communications for wireless communication.
- the processor 204 may include an arithmetic logic unit, a microprocessor, a general purpose controller or some other processor array to perform computations and provide electronic display signals to a display device.
- the processor 204 is a hardware processor having one or more processing cores.
- the processor 204 is coupled to the bus 220 for communication with the other components.
- Processor 204 processes data signals and may include various computing architectures including a complex instruction set computer (CISC) architecture, a reduced instruction set computer (RISC) architecture, or an architecture implementing a combination of instruction sets.
- CISC complex instruction set computer
- RISC reduced instruction set computer
- FIG. 2 multiple processors and/or processing cores may be included. It should be understood that other processor configurations are possible.
- the memory 206 stores instructions and/or data that may be executed by the processor 204 .
- the memory 206 is coupled to the bus 220 for communication with the other components of the system 200 .
- the instructions and/or data stored in the memory 206 may include code for performing any and/or all of the techniques described herein.
- the memory 206 may be, for example, non-transitory memory such as a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, flash memory or some other memory devices.
- DRAM dynamic random access memory
- SRAM static random access memory
- the memory 206 also includes a non-volatile memory or similar permanent storage device and media, for example, a hard disk drive, a floppy disk drive, a compact disc read only memory (CD-ROM) device, a digital versatile disc read only memory (DVD-ROM) device, a digital versatile disc random access memories (DVD-RAM) device, a digital versatile disc rewritable (DVD-RW) device, a flash memory device, or some other non-volatile storage device.
- a non-volatile memory or similar permanent storage device and media for example, a hard disk drive, a floppy disk drive, a compact disc read only memory (CD-ROM) device, a digital versatile disc read only memory (DVD-ROM) device, a digital versatile disc random access memories (DVD-RAM) device, a digital versatile disc rewritable (DVD-RW) device, a flash memory device, or some other non-volatile storage device.
- CD-ROM compact disc read only memory
- DVD-ROM
- the storage interface (I/F) module 208 accesses information requested by the clients 102 .
- the information may be stored on any type of attached array of writable storage media, such as magnetic disk or tape, optical disk (e.g., CD-ROM or DVD), flash memory, solid-state drive (SSD), electronic random access memory (RAM), micro-electro mechanical and/or any other similar media adapted to store information, including data and parity information.
- the information is stored on disks 108 .
- the storage I/F module 208 includes a plurality of ports having input/output (I/O) interface circuitry that couples with the disks over an I/O interconnect arrangement.
- the delta compression engine 110 of system 200 may be configured to compress data for storage or transfer based on a delta compression similarity based data deduplication technique in accordance with the present disclosure.
- Delta compression engine 110 may include block signature module 210 , reference block index module 212 , and delta encoding module 214 .
- the block signature module 210 may be configured to compute signature sketches for data blocks based on a fingerprint computation. The signature sketches may be determined according to any generally known fingerprint computation. An exemplary fingerprint computation is described in accordance with the present disclosure.
- the block signature module 210 may be configured to determine the signature sketches of new incoming data blocks based on a fingerprint computation.
- the block signature module may be configured to determine the signature sketches of data blocks that will be stored in a reference list table or dictionary of reference data blocks.
- the reference block index module 212 is in communication with the block signature module 210 to receive signature sketches determined by the block signature module 210 .
- the reference block index module 212 may be configured to generate and search a reference index and reference dictionary using a determined block signature sketch, according to techniques disclosed herein, in order to identify related reference data blocks that may be used as a basis for delta compression.
- the reference block index module 212 may access, store, generate, and/or manage a reference index containing reference fingerprints or signature sketches (computed by the block signature module 210 ) against which new incoming fingerprints may be compared.
- the reference block index module 212 may be configured to compare a newly generated fingerprint to indexed fingerprints to identify a similar reference data block.
- the delta encoding module 214 compares an incoming data block corresponding with the newly generated fingerprint to a related reference data block stored among reference data blocks. For example, the delta encoding module 214 scans the incoming data block and the reference data block to determine a match between one or more data elements of the data blocks. The delta encoding module 214 encodes the new data block using matching data elements between the new data block and the reference data block to produce a compressed delta.
- the block signature module 210 , the reference block index module 212 , and the delta encoding module 214 may be implemented in hardware, e.g. on a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or the like.
- the modules 210 , 212 , and 214 may be implemented on a V6-240T FGPA, or the like, and act as a co-processor in system 200 . While depicted in FIGS. 2 as distinct modules, it should be understood that one or more of the modules 210 , 212 , and/or 214 may be implemented on the same hardware device or various hardware devices.
- FIG. 3 illustrates a block diagram of an example hardware architecture and logical flow of a data through the delta compression engine in accordance with the present disclosure.
- Reference sketches 310 are loaded into dictionary 318 .
- Dictionary 318 is a reference list table built up of reference data blocks associated with their fingerprint sketches (e.g., reference sketches 310 ). Dictionary 318 may be stored in random-access memory (RAM).
- Fast index 416 is a hash index table.
- a hash function 314 is performed on each reference sketch and a hash index table is built up of hash key records, where each record forms a pair composed of a hash key and an index to the reference list.
- the hash index table may be stored in RAM.
- a new sketch 312 is received and a hash function 314 is performed on the new sketch 312 .
- a hash key of the new sketch 314 is used to search fast index 416 for a similar hash key of a related reference sketch in one of the hash index records of fast index 416 . If a matching hash key is found, the hash key record including an index to the reference list is used to locate a related reference sketch and its corresponding related reference data block in the dictionary 318 .
- the new data block corresponding to new sketch 312 is compared at 322 to the related reference data block corresponding to the related reference sketch.
- a flag 323 is set based on a determination of a match between one or more data elements of the new data block and one or more data elements of the related reference data block.
- the new data block is delta compressed against the related reference data block and stored according to an encoding scheme using the match.
- a reference sketch 310 and a new sketch 312 are received by delta compression engine.
- a sketch may be used to represent each data block and keep track of I/O access patterns to all sketches.
- the reference block index module 212 may be configured to generate dictionary 318 by storing reference data blocks and their sketches in a reference list. For example, based on content locality, access frequency, and/or recency of data contents, some of the most popular data blocks are selected and cached in dictionary 318 as reference data blocks in a reference list.
- a newly generated block sketch, e.g. new sketch 312 is used as key to search the reference list of dictionary 318 to find a related reference data block in the reference list.
- the new data block corresponding to the new sketch 312 is compared to the related reference data block and then delta compressed against the related reference data block to produce a compressed delta.
- the compressed delta and a pointer to the related reference data block are stored in primary storage or cache.
- a sketch contains 8 fingerprints each of which is one byte long. If a reference data block has n fingerprints that match between their respective sketches (n from 4 to 8), the two data blocks are considered near duplicate blocks. n is referred to as a similarity threshold. Once a near duplicate block is found in the reference index, i.e. fast index 416 , using a hash 314 of the new sketch 312 as key, the corresponding reference data block will be read out of the dictionary and delta compression will be performed against it.
- FIG. 4 illustrates a two parallel pipeline structure design of the delta compression engine which may be employed according to the present disclosure.
- one pipeline e.g., reference pipeline 410
- the other pipeline e.g., compression pipeline 420
- the reference pipeline 410 processes reference data blocks (e.g., data blocks determined to be frequently or recently accessed) to load the reference data blocks into the dictionary 318 .
- reference data blocks e.g., data blocks determined to be frequently or recently accessed
- portions of the reference data block e.g., 8 byte portions shifted 1 byte at a time
- another block RAM may be used to build a fast index 316 .
- the compression pipeline 420 processes an incoming new data block such that a quick search for repeated strings may be performed through the fast search structure. For example, an incoming new data block is hashed into a hash value that is used as a key to search at 422 for a related reference data block in dictionary 318 . In some embodiments, a bitwise comparison may be performed to confirm a bit-by-bit match of the two strings. Once a match is found at 424 , a sequential search at 423 is performed to maximize the match length. The search results are then encoded at 428 .
- a sequential search may be performed by an address prediction technique in order to optimize the length of the matched data string and maximize the compression ratio.
- the address prediction technique when a match is found, the delta encoding module 214 will predict the next matching dictionary index location is the location directly after the current location, and will not search the dictionary by the hash key value for the next match.
- the compression hardware of the present disclosure is further optimized to have wire speed compression by the design of a parallel delta compression encoding structure as seen in FIG. 11 .
- string matching is done for every 8 byte data chunk where subsequent data chunks in a data block are shifted by just one byte at a time.
- the bus width is 8 bytes, so the data transfer speed of the bus may be faster than one delta compression engine. Therefore, some embodiments include eight compression channels working in parallel to achieve wire speed. In one embodiment, each channel stores and encodes one data chunk.
- FIG. 5 is a flow chart of an example method for delta compression encoding a new reference data block.
- reference data blocks are loaded into dictionary 318 and sketches are generated for the loaded reference data blocks.
- a sketch of a reference data block is generated by the block signature module 210 creating a group of fingerprints characterizing the data of the reference data block.
- the reference data blocks are chosen based on how frequently and/or recently the data blocks have been accessed.
- the reference block index module 212 identifies a reference data block related to an incoming new data block using a sketch of the new data block as a key to the dictionary 318 .
- the reference block index module 212 further uses a fast hash index 416 as described above.
- the new data block and the identified related reference data block are fed into delta encoding module 214 .
- the delta encoding module 214 scans the related reference data block and the new data block for repetitive or matching data strings or data elements.
- the matched data elements of the new data block are encoded 512 according to the encoding output structure for matched data elements as described herein.
- the delta encoding module 214 determines if the end of the new data block has been reached 516 , the process returns to 504 where a new data block will be encoded. If, at 516 , the end of the new data block has not been reached, the method continues 508 to scan the new data block and the related reference data block for matching data elements or data strings in order to encode the remaining data elements of the new data block.
- FIG. 6 illustrates an example of delta compression encoding according to the techniques disclosed herein.
- Blk ref is used to refer to a related reference data block
- Blk new is used to refer to a new data block to be compressed using the related reference data block.
- the related reference data block is loaded into the dictionary prior to receiving the new data block for compression.
- the delta encoding module 214 compares the two data blocks to determine repetitions between the two data blocks.
- the encoded data includes a number of fields to identify matched or non-matched data elements and locations to where the data elements can be found on storage media.
- the fields may include an offset, a flag, an index, and a length.
- the offset field indicates the position of a data element in the new data block or the related reference data block. For example, when data elements in the new data block and the related reference data block match, the offset field indicates the ending position of the matched one or more data elements in the new data block. Similarly, when a data element in the new data block does not match, the offset field indicates the position of the data element in the new data block that did not match a data element in the related reference data block.
- the flag field indicates whether a data element in the new data block has a match in the related reference data block. For example, the flag field may be set to 1 if a match is found in the related reference data block for a data element of the new data block and may be set to 0 if no match is found.
- the index field indicates the starting position of the matched string in the related reference data block.
- the length field indicates the total length of the matched string.
- the miss field indicates the data elements from the new data block which do not appear in the related reference data block (e.g., when the flag field is set to 0). For example, the miss field may store a physical or logical address for the data elements stored to a storage device.
- the output for the above described match may be encoded as (1,1,7,2) with a reference to the related reference data block, as shown in the example of FIG. 6 .
- the output may be encoded as (3,0, Dw 4 ).
- Algorithm 1 shows the process for single dictionary encoding.
- both reference data block dictionary updating and new data block delta encoding can be processed in line speed by parallel computation in hardware design.
- Algorithm 2 below shows the process for multiple dictionary encoding where a single large dictionary may be split into 8 smaller dictionaries such that multiple dictionaries may perform parallel store and search.
- FIG. 7 illustrates a block diagram of a hardware decompression logic architecture.
- a multiplexor (MUX) 720 selects either the value from dictionary 718 or miss 704 and sends the selected value to decompression FIFO 730 for recovery of the delta compressed data.
- the dictionary 718 or miss 704 stores a reference to data stored elsewhere and provides the reference to the decompression FIFO 730.
- the value of flag 703 is determined by whether a string in a delta compressed data block has a match in a related reference data block. If there is a match, (e.g., flag 703 holds the value 1), index 701 and length 702 are used to produce the data stream or corresponding data elements from the dictionary 718 .
- miss data 704 refers to the value of the data element in a delta compressed data block that did not have a match to a data element in a related reference data block.
- data block sketches e.g. reference sketch 310 and new sketch 312
- data block sketches are derived by a Rabin fingerprint calculation for every fix-sized sliding window (e.g. 8 bytes long).
- the block signature module 210 processes multiple bits in one clock cycle to provide fingerprinting for high data rate applications.
- a single modulo operation e.g., determining a Rabin fingerprint
- the data string is 64 bits resulting in 16-bit Rabin fingerprints.
- a combinatorial circuit may be used to computer an exclusive-OR (XOR) all of the corresponding input bits.
- XOR exclusive-OR
- the combination of these 16 circuits is referred to herein as a Fresh function.
- FIG. 8 depicts shingles in a data stream from ⁇ 0 to ⁇ 71 , where (X) is the first shingle, and (X) is the second shingle. While the example of FIG. 8 depicts a shift of one byte, shingles can shift in various other multiples of bits.
- the Fresh function may be replicated over each shingle. However, it is evident that overlapping computations occur in this scheme.
- the relation between the Rabin fingerprints of A and B can be calculated as:
- the fingerprint of the new shingle B(x) is dependent on the fingerprint of the old shingle A(x), the first byte of the old shingle U(x), and the first byte of incoming data W(x), which is the last byte of the new shingle B(x).
- the fingerprint calculation of each shingle can be optimized using the fingerprint calculation of the previous shingle.
- FIG. 9 An incremental computation pipeline design is illustrated in FIG. 9 .
- the data is drawn from two consecutive clock cycles, for example ( ⁇ 0, ⁇ 1 , . . . , ⁇ 63) from the preceding cycle and ( ⁇ 64, ⁇ 65, . . . ⁇ 127) from the following cycle.
- the techniques disclosed herein include finding an irreducible polynomial for which Rabin fingerprint computation has the least amount of operations for one full computation and several incremental computations of a multiple byte data shingle to group the data in a stream (e.g., seven incremental computations for an eight byte data shingle).
- the techniques further include computing a Rabin fingerprint incrementally using the selected irreducible polynomial. For example, incremental computation may allow computation of a fingerprint to reuse calculations results from a previous fingerprint calculation of eight bytes. As an example, the fingerprint calculation may calculate the fingerprint of all eight bytes numbered zero to seven, and may shift one byte to the right for a next clock cycle.
- the calculations for bytes zero to seven may be reused and the calculations involving byte eight, and byte zero may be performed.
- the fingerprint for the shingle of bytes one to eight may be performed incrementally, reusing the calculations of the prior fingerprint for eight bytes and performing new calculations.
- FIG. 10 is a block diagram illustrating an example block signature module 210 .
- the example block signature module 210 includes a fingerprint pipeline 1002 , a number of sampling modules 1004 a - 1004 n , and a fingerprint selection module 1006 .
- data 1008 flows from top to bottom through the fingerprint pipeline.
- the total number of fingerprints generated for a w-byte data chunk according to the techniques disclose here is w ⁇ b+1, where b is the size of the shingles.
- several fingerprints may be chosen from among all of the fingerprints as a sketch to represent the data chunk.
- fingerprints with upper N bits having a specific pattern are selected for the sketch since these upper bits in each fingerprint can be considered as randomly distributed.
- the result of this selection is a good choice in terms of balancing processing speed, similarity detection, elimination of false positives, and resolution.
- Fingerprint results produced at every pipeline stage are sent to the right for the corresponding channel sampling modules to process.
- the fingerprints are sampled and stored in an intermediate buffer. After the sampling for a data chunk is done, the fingerprint selection module will choose from the intermediate samples and returns a sketch for the data block.
- the pipeline is composed of one Fresh function and several following Shift functions.
- a process can generally be considered a self-consistent sequence of steps leading to a result.
- the steps may involve physical manipulations of physical quantities. These quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. These signals may be referred to as being in the form of bits, values, elements, symbols, characters, terms, numbers or the like.
- the disclosed technologies may also relate to an apparatus for performing the operations herein.
- This apparatus may be specially constructed for the required purposes, or it may include a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer.
- a computer program may be stored in a computer readable storage medium, for example, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, flash memories including USB keys with non-volatile memory or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
- the disclosed technologies can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements.
- the technology is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
- a computer-usable or computer-readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
- a computing system or data processing system suitable for storing and/or executing program code will include at least one processor (e.g., a hardware processor) coupled directly or indirectly to memory elements through a system bus.
- the memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
- I/O devices including but not limited to keyboards, displays, pointing devices, etc.
- I/O controllers can be coupled to the system either directly or through intervening I/O controllers.
- Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks.
- Modems, cable modems and Ethernet cards are just a few of the currently available types of network adapters.
- modules, routines, features, attributes, methodologies and other aspects of the present technology can be implemented as software, hardware, firmware or any combination of the three.
- a component an example of which is a module
- the component can be implemented as a standalone program, as part of a larger program, as a plurality of separate programs, as a statically or dynamically linked library, as a kernel loadable module, as a device driver, and/or in every and any other way known now or in the future in computer programming.
- the present techniques and technologies are in no way limited to embodiment in any specific programming language, or for any specific operating system or environment. Accordingly, the disclosure of the present techniques and technologies is intended to be illustrative, but not limiting.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- The present application claims priority, under 35 U.S.C. §119, of U.S. Provisional Patent Application No. 62/201,493, filed Aug. 5, 2015 and entitled “Delta Compression Engine For Similarity Based Data Deduplication,” which is incorporated by reference in its entirety.
- The present disclosure relates to data compression techniques. In particular, the present disclosure relates to a hardware embodiment of a delta compression engine for similar chunks of data.
- Data deduplication techniques for improving storage utilization are becoming increasingly important due to explosive growth of data in the world of the Internet and enterprise backup environments. Data deduplication involves a data compression technique for eliminating redundant data and thus reducing the amount of storage space needed to save data. Data deduplication like other lossless compression techniques are used to reduce the amount of data transfer (e.g., data sent across a WAN for disaster recovery or remote backups) and data store (e.g., data retained on storage media such as tape or disk). Lossless compression techniques usually incur trade-offs between compression ratio and speed. Classic lossless compression algorithms such as LZ77 or LZO apply byte-level based searching of a dictionary and thus require a large DRAM resource as dictionary storage, which incurs a slower deduplication process. Snappy, an open source data compression algorithm written in C++, aims at achieving high speed rather than a maximized compression ratio. Other conventional deduplication technologies only look at identical data blocks, thus missing opportunities for compression where similar, non-identical, data blocks exist widely in data storage.
- Data deduplication techniques have proven successful in backup systems where duplicate data blocks are prevalent, however, achieving the same success in primary storage, which is mainly used in a production environment, has proven challenging. One challenge involves achieving maximized compression ratio in primary storage where similar data blocks, as opposed to duplicate data blocks, are more prevalent. Another challenge involves improving performance where the required response time for each data unit in primary storage deduplication systems is much shorter than backup deduplication systems. An additional challenge involves the limitation of resources and the slowing down of application performance running on a server. While backup deduplication systems have their own resources, primary storage deduplication systems share resources such as the CPU and RAM utilized in the production environment, which could result in performance degradation of applications running on the server.
- Systems and methods of a delta compression engine for similarity based data deduplication are disclosed. The present disclosure describes a delta compression engine including a block sketch computation module, a reference block indexing module, and a similar block delta compression module. The present disclosure further describes methods for delta compression.
- Other embodiments of one or more of these aspects include corresponding systems, apparatus, and computer programs, configured to perform the actions of the methods, encoded on computer storage devices. It should be understood that the language used in the present disclosure has been principally selected for readability and instructional purposes, and not to limit the scope of the subject matter disclosed herein.
- The present disclosure is illustrated by way of example, and not by way of limitation in the figures of the accompanying drawings in which like reference numerals are used to refer to similar elements.
-
FIG. 1 is a high-level block diagram illustrating an example system including a storage controller having a delta compression engine. -
FIG. 2 is a block diagram illustrating an example system configured to implement the techniques introduced herein. -
FIG. 3 illustrates a block diagram of an example hardware architecture and logical flow of a data through the delta compression engine, according to the techniques described herein. -
FIG. 4 illustrates a two parallel pipeline structure design of the delta compression engine, according to the techniques described herein. -
FIG. 5 is a flow chart of an example method for delta compression encoding a new reference data block, according to the techniques described herein. -
FIG. 6 illustrates an example of delta compression encoding, according to the techniques described herein. -
FIG. 7 illustrates a block diagram of a hardware decompression logic architecture, according to the techniques described herein. -
FIG. 8 is a graphic representation of shingles in a data stream, according to the techniques described herein. -
FIG. 9 is a graphic representation of an incremental computation pipeline design, according to the techniques described herein. -
FIG. 10 is a block diagram illustrating an example block signature module, according to the techniques described herein. -
FIG. 11 illustrates a parallel delta compression encoding structure, according to the techniques described herein. - Systems and methods for implementing a pipelined hardware architecture of a delta compression engine for similarity based data deduplication are described below. While the systems and methods of the present disclosure are described in the context of a particular system architecture, it should be understood that the systems, methods and interfaces can be applied to other architectures and organizations of hardware.
- A hardware implemented delta compression system and method are needed to provide line speed data deduplication, to improve latency and compression ratio over software delta compression engines running on servers, to improve throughput, to provide for better data reduction ratio over conventional techniques, and to make similarity based deduplication more applicable to primary storage or storage caches. The hardware implementation introduced herein provides for improved processing speed for data deduplication of similar data chunks. Delta compression may be processed in line speed, provide high throughput, and fast response time by means of pipelining and parallel data lookup across multiple hardware modules. Additionally, the hardware implementation introduced herein offers an offload of deduplication functions from servers so that application performance is not negatively affected. The hardware architecture introduced herein may be implemented on a field-programmable gate array (FPGA). However, the hardware architecture should not be limited to implementation on a FPGA. For example, the delta compression engine of the present disclosure may be implemented on other integrated circuits, such as an application-specific integrated circuit (ASIC).
- Data deduplication is a data compression technique for improving storage utilization by eliminating redundant copies of data. Data deduplication techniques are also applicable to data transfer by reducing the size of data, e.g., the number of bytes, sent over a network. Data deduplication involves the identification and storage of unique blocks or chunks of data, e.g. byte patterns. Data deduplication systems work by retaining a single unique block of data on storage media, such as tape or disk, and referencing the single unique block of data for all data objects that include a matching block of data. A delta compression process as introduced herein may involve splitting a file into multiple chunks and generating a fingerprint for each chunk. The fingerprint may be a strong hash digest of the chunk. The delta compression process may further involve determining whether two fingerprints match. A new incoming chunk's fingerprint is compared to an existing chunk's fingerprint previously stored in the delta compression system. A determination that the two fingerprints match is an indicator that the contents of the chunks are duplicate or identical. If the two fingerprints match, only metadata for the new incoming chunk, such as a file name or logical block address (LBA) and a reference to the existing content, will be stored. For example, a redundant new incoming chunk is not retained however is replaced by a small pointer to the stored existing chunk. In another embodiment, a similar new incoming chunk is encoded and stored as a small pointer to a stored existing similar chunk and the difference in data between the new incoming chunk and the stored existing chunk. The terms block or chunk are used interchangeably in the present disclosure to refer to a basic unit of data deduplication. The terms block or chunk may refer to data of different sizes including, but not limited to, a file, data stream, or byte pattern.
- Data blocks and files in primary storage are often modified by functions such as cut, insert, delete, and update and reassembled in different contexts and packages. Depending on the strength of a hash function used on a data block, a slightly modified data block may generate a different hash sketch. When a stronger has function is used, a slightly modified data block will generate a hash sketch different than the original data block. However, the different hash sketch will not be indexed and stored by a standard deduplication process, which is generally determined by the indication of a duplicate or identical match. If a weaker hash function is used on a slightly modified data block, the sketch of the modified block may be the same as the sketch of the pre-modified data block. The weaker hash sketches may include e.g. several Rabin fingerprints and have the property that if two data blocks share the same sketch, then the two data blocks contain a lot of the same content, i.e. the two data blocks are likely near-duplicate.
- In similarity based deduplication using delta compression, a new incoming block is compared to a list of reference data blocks to identify a related reference data block by comparing their sketches. If a related reference data block is identified among the list of reference data blocks, a delta compression of the new incoming block is performed against the identified related reference data block and only the delta is stored along with a pointer to the identified related reference data block. By deriving the differences between near-duplicate data blocks, delta compression can effectively deduplicate data at both file or block levels. The central tenet of delta compression is to find the difference between two similar data blocks or chunks and try to retain only one of the two blocks in storage. The difference between the stored block and the remaining block along with a reference to the stored block is stored for the remaining block. Delta compression techniques offer deduplication benefit gains of 1.4 times compared to conventional deduplication techniques. However, improvements to the throughput of the system may be achieved through a hardware embodiment making the similarity based deduplicaiton techniques described in the present disclosure more applicable to primary storage or storage caches, (e.g., providing approximately one gigabyte per second throughput and a sub-millisecond in latency). embodiment
-
FIG. 1 is a high-level block diagram illustrating anexample system 100 including a storage controller having a delta compression engine. Thesystem 100 includes one or more clients 102 a . . . 102 n, anetwork 104, and a storage system includingstorage controller 106 andstorage devices 108 a . . . n. Thestorage controller 106 includesdelta compression engine 110. - The client devices 102 a . . . 102 n can be any computing device including one or more memory and one or more processors, for example, a laptop computer, a desktop computer, a tablet computer, a mobile telephone, a personal digital assistant (PDA), a mobile email device, a portable game player, a portable music player, a television with one or more processors embedded therein or coupled thereto or any other electronic device capable of making storage requests. A client device 102 may execute an application that makes storage requests (e.g., read, write, etc.) to the storage devices 108. While the example of
FIG. 1 includes two clients, 102 a and 102 n, it should be understood that any number of clients 102 may be present in the system. Clients (e.g., client 102 a) may be directly coupled with storage sub-systems including individual storage devices (e.g.,storage device 108 a) viastorage controller 106. Optionally, clients may be indirectly coupled with storage sub-systems including individual storage devices 108 via a separate controller. - In some embodiments, the
system 100 includes astorage controller 106 that provides a single interface for the client devices 102 to access the storage devices 112 in the storage system. Thestorage controller 106 may be a computing device configured to make some or all of the storage space on disks 108 available to clients 102. As depicted in theexample system 100, client devices can be coupled to thestorage controller 106 via network 104 (e.g., client 102 a) or directly (e.g.,client 102 n). - The
network 104 can be one of a conventional type, wired or wireless, and may have numerous different configurations including a star configuration, token ring configuration, or other configurations. Furthermore, thenetwork 104 may include a local area network (LAN), a wide area network (WAN) (e.g., the internet), and/or other interconnected data paths across which multiple devices (e.g.,storage controller 106, client device 102, etc.) may communicate. In some embodiments, thenetwork 104 may be a peer-to-peer network. Thenetwork 104 may also be coupled with or include portions of a telecommunications network for sending data using a variety of different communication protocols. In some embodiments, thenetwork 104 may include Bluetooth (or Bluetooth low energy) communication networks or a cellular communications network for sending and receiving data including via short messaging service (SMS), multimedia messaging service (MMS), hypertext transfer protocol (HTTP), direct data connection, WAP, email, etc. Although the example ofFIG. 1 illustrates onenetwork 104, in practice one ormore networks 104 can connect the entities of thesystem 100. -
FIG. 2 is a block diagram illustrating anexample system 200 configured to implement the techniques introduced herein. In one embodiment, thesystem 200 may be a client device 102. In other embodiments, thesystem 200 may bestorage controller 106. In yet further embodiments, thesystem 200 may be implemented as a combination of a client device andstorage controller 106. - The
system 200 includes a network interface (IF)module 202, aprocessor 204, amemory 206, a storage interface (IF)module 208, adelta compression engine 110, and astorage device 216.Delta compression engine 110 includesblock signature module 210, a referenceblock index module 212, and adelta encoding module 214. The components of thesystem 200 are communicatively coupled to a bus orsoftware communication mechanism 220 for communication with each other. - In some embodiments,
software communication mechanism 220 may be an object bus (e.g., CORBA), direct socket communication (e.g., TCP/IP sockets) among software modules, remote procedure calls, UDP broadcasts and receipts, HTTP connections, function or procedure calls, etc. Further, any or all of the communication could be secure (SSH, HTTPS, etc.). Thesoftware communication mechanism 220 can be implemented on any underlying hardware, for example, a network, the Internet, a bus, a combination thereof, etc. - The network interface (I/F)
module 202 is configured to connectsystem 200 to a network and/or other system (e.g., network 104). For example,network interface module 202 may enable communication through one or more of the internet, cable networks, and wired networks. Thenetwork interface module 202 links theprocessor 204 to thenetwork 104 that may in turn be coupled to other processing systems (e.g., a server). Thenetwork interface module 202 also provides other conventional connections to thenetwork 104 for distribution and/or retrieval of files and/or media objects using standard network protocols such as TCP/IP, HTTP, HTTPS and SMTP as will be understood. In some embodiments, thenetwork interface module 202 includes a transceiver for sending and receiving signals using WiFi, Bluetooth® or cellular communications for wireless communication. - The
processor 204 may include an arithmetic logic unit, a microprocessor, a general purpose controller or some other processor array to perform computations and provide electronic display signals to a display device. In some embodiments, theprocessor 204 is a hardware processor having one or more processing cores. Theprocessor 204 is coupled to thebus 220 for communication with the other components.Processor 204 processes data signals and may include various computing architectures including a complex instruction set computer (CISC) architecture, a reduced instruction set computer (RISC) architecture, or an architecture implementing a combination of instruction sets. Although only a single processor is shown in the example ofFIG. 2 , multiple processors and/or processing cores may be included. It should be understood that other processor configurations are possible. - The
memory 206 stores instructions and/or data that may be executed by theprocessor 204. Thememory 206 is coupled to thebus 220 for communication with the other components of thesystem 200. The instructions and/or data stored in thememory 206 may include code for performing any and/or all of the techniques described herein. Thememory 206 may be, for example, non-transitory memory such as a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, flash memory or some other memory devices. In some embodiments, thememory 206 also includes a non-volatile memory or similar permanent storage device and media, for example, a hard disk drive, a floppy disk drive, a compact disc read only memory (CD-ROM) device, a digital versatile disc read only memory (DVD-ROM) device, a digital versatile disc random access memories (DVD-RAM) device, a digital versatile disc rewritable (DVD-RW) device, a flash memory device, or some other non-volatile storage device. - The storage interface (I/F)
module 208 accesses information requested by the clients 102. The information may be stored on any type of attached array of writable storage media, such as magnetic disk or tape, optical disk (e.g., CD-ROM or DVD), flash memory, solid-state drive (SSD), electronic random access memory (RAM), micro-electro mechanical and/or any other similar media adapted to store information, including data and parity information. However, as illustratively described herein, the information is stored on disks 108. The storage I/F module 208 includes a plurality of ports having input/output (I/O) interface circuitry that couples with the disks over an I/O interconnect arrangement. - In some embodiments, the
delta compression engine 110 ofsystem 200 may be configured to compress data for storage or transfer based on a delta compression similarity based data deduplication technique in accordance with the present disclosure.Delta compression engine 110 may includeblock signature module 210, referenceblock index module 212, anddelta encoding module 214. In one embodiment, theblock signature module 210 may be configured to compute signature sketches for data blocks based on a fingerprint computation. The signature sketches may be determined according to any generally known fingerprint computation. An exemplary fingerprint computation is described in accordance with the present disclosure. In one embodiment, theblock signature module 210 may be configured to determine the signature sketches of new incoming data blocks based on a fingerprint computation. In another embodiment, the block signature module may be configured to determine the signature sketches of data blocks that will be stored in a reference list table or dictionary of reference data blocks. - In some embodiments, the reference
block index module 212 is in communication with theblock signature module 210 to receive signature sketches determined by theblock signature module 210. The referenceblock index module 212 may be configured to generate and search a reference index and reference dictionary using a determined block signature sketch, according to techniques disclosed herein, in order to identify related reference data blocks that may be used as a basis for delta compression. The referenceblock index module 212 may access, store, generate, and/or manage a reference index containing reference fingerprints or signature sketches (computed by the block signature module 210) against which new incoming fingerprints may be compared. The referenceblock index module 212 may be configured to compare a newly generated fingerprint to indexed fingerprints to identify a similar reference data block. - In some embodiments, the
delta encoding module 214 compares an incoming data block corresponding with the newly generated fingerprint to a related reference data block stored among reference data blocks. For example, thedelta encoding module 214 scans the incoming data block and the reference data block to determine a match between one or more data elements of the data blocks. Thedelta encoding module 214 encodes the new data block using matching data elements between the new data block and the reference data block to produce a compressed delta. - The
block signature module 210, the referenceblock index module 212, and thedelta encoding module 214 may be implemented in hardware, e.g. on a field programmable gate array (FPGA), an application specific integrated circuit (ASIC), or the like. For example, the 210, 212, and 214 may be implemented on a V6-240T FGPA, or the like, and act as a co-processor inmodules system 200. While depicted inFIGS. 2 as distinct modules, it should be understood that one or more of the 210, 212, and/or 214 may be implemented on the same hardware device or various hardware devices.modules -
FIG. 3 illustrates a block diagram of an example hardware architecture and logical flow of a data through the delta compression engine in accordance with the present disclosure. Reference sketches 310 are loaded intodictionary 318.Dictionary 318 is a reference list table built up of reference data blocks associated with their fingerprint sketches (e.g., reference sketches 310).Dictionary 318 may be stored in random-access memory (RAM). Fast index 416 is a hash index table. Ahash function 314 is performed on each reference sketch and a hash index table is built up of hash key records, where each record forms a pair composed of a hash key and an index to the reference list. The hash index table may be stored in RAM. After thefast index 316 anddictionary 318 are built up, anew sketch 312 is received and ahash function 314 is performed on thenew sketch 312. A hash key of thenew sketch 314 is used to search fast index 416 for a similar hash key of a related reference sketch in one of the hash index records of fast index 416. If a matching hash key is found, the hash key record including an index to the reference list is used to locate a related reference sketch and its corresponding related reference data block in thedictionary 318. After a bus system delay 320 to account for thehash function 314 and index search on thenew sketch 312, the new data block corresponding tonew sketch 312 is compared at 322 to the related reference data block corresponding to the related reference sketch. While scanning the new data block and the related reference data block, a flag 323, is set based on a determination of a match between one or more data elements of the new data block and one or more data elements of the related reference data block. The new data block is delta compressed against the related reference data block and stored according to an encoding scheme using the match. - In one embodiment, a
reference sketch 310 and anew sketch 312 are received by delta compression engine. A sketch may be used to represent each data block and keep track of I/O access patterns to all sketches. The referenceblock index module 212 may be configured to generatedictionary 318 by storing reference data blocks and their sketches in a reference list. For example, based on content locality, access frequency, and/or recency of data contents, some of the most popular data blocks are selected and cached indictionary 318 as reference data blocks in a reference list. A newly generated block sketch, e.g.new sketch 312, is used as key to search the reference list ofdictionary 318 to find a related reference data block in the reference list. The new data block corresponding to thenew sketch 312 is compared to the related reference data block and then delta compressed against the related reference data block to produce a compressed delta. The compressed delta and a pointer to the related reference data block are stored in primary storage or cache. - In one embodiment, a sketch contains 8 fingerprints each of which is one byte long. If a reference data block has n fingerprints that match between their respective sketches (n from 4 to 8), the two data blocks are considered near duplicate blocks. n is referred to as a similarity threshold. Once a near duplicate block is found in the reference index, i.e. fast index 416, using a
hash 314 of thenew sketch 312 as key, the corresponding reference data block will be read out of the dictionary and delta compression will be performed against it. -
FIG. 4 illustrates a two parallel pipeline structure design of the delta compression engine which may be employed according to the present disclosure. As seen inFIG. 4 , one pipeline, e.g.,reference pipeline 410, is used to build the dictionary using the reference data block while the other pipeline, e.g.,compression pipeline 420, scans an incoming data block to be compressed. - In one embodiment, the
reference pipeline 410 processes reference data blocks (e.g., data blocks determined to be frequently or recently accessed) to load the reference data blocks into thedictionary 318. For example, at 412 portions of the reference data block (e.g., 8 byte portions shifted 1 byte at a time), are hashed into a hash value that is used to search for a matching string indictionary 318. To avoid linear search of thedictionary 318, another block RAM may be used to build afast index 316. - The
compression pipeline 420 processes an incoming new data block such that a quick search for repeated strings may be performed through the fast search structure. For example, an incoming new data block is hashed into a hash value that is used as a key to search at 422 for a related reference data block indictionary 318. In some embodiments, a bitwise comparison may be performed to confirm a bit-by-bit match of the two strings. Once a match is found at 424, a sequential search at 423 is performed to maximize the match length. The search results are then encoded at 428. - In one embodiment, a sequential search may be performed by an address prediction technique in order to optimize the length of the matched data string and maximize the compression ratio. Using the address prediction technique, when a match is found, the
delta encoding module 214 will predict the next matching dictionary index location is the location directly after the current location, and will not search the dictionary by the hash key value for the next match. - The compression hardware of the present disclosure is further optimized to have wire speed compression by the design of a parallel delta compression encoding structure as seen in
FIG. 11 . Generally, string matching is done for every 8 byte data chunk where subsequent data chunks in a data block are shifted by just one byte at a time. In one embodiment, the bus width is 8 bytes, so the data transfer speed of the bus may be faster than one delta compression engine. Therefore, some embodiments include eight compression channels working in parallel to achieve wire speed. In one embodiment, each channel stores and encodes one data chunk. -
FIG. 5 is a flow chart of an example method for delta compression encoding a new reference data block. At 502, reference data blocks are loaded intodictionary 318 and sketches are generated for the loaded reference data blocks. As described above, a sketch of a reference data block is generated by theblock signature module 210 creating a group of fingerprints characterizing the data of the reference data block. In one embodiment, the reference data blocks are chosen based on how frequently and/or recently the data blocks have been accessed. At 504, the referenceblock index module 212 identifies a reference data block related to an incoming new data block using a sketch of the new data block as a key to thedictionary 318. In some embodiments, the referenceblock index module 212 further uses a fast hash index 416 as described above. At 506, the new data block and the identified related reference data block are fed intodelta encoding module 214. At 508, thedelta encoding module 214 scans the related reference data block and the new data block for repetitive or matching data strings or data elements. At 510, if thedelta encoding module 214 finds a match between one or more data elements of the new data block and the related reference data block, the matched data elements of the new data block are encoded 512 according to the encoding output structure for matched data elements as described herein. If, at 510, thedelta encoding module 214 does not find a match between one or more data elements or data string of the new data block, the non-matched data elements or data string is encoded 514 according to the encoding output structure for non-matched data elements as described herein. After encoding matched 512 or non-matched 514 data elements or data strings, theencoding module 214 determines if the end of the new data block has been reached 516, the process returns to 504 where a new data block will be encoded. If, at 516, the end of the new data block has not been reached, the method continues 508 to scan the new data block and the related reference data block for matching data elements or data strings in order to encode the remaining data elements of the new data block. -
FIG. 6 illustrates an example of delta compression encoding according to the techniques disclosed herein. Throughout the description ofFIG. 6 , Blkref is used to refer to a related reference data block and Blknew is used to refer to a new data block to be compressed using the related reference data block. As described above, the related reference data block is loaded into the dictionary prior to receiving the new data block for compression. As described above, thedelta encoding module 214 compares the two data blocks to determine repetitions between the two data blocks. The encoded data includes a number of fields to identify matched or non-matched data elements and locations to where the data elements can be found on storage media. For example, the fields may include an offset, a flag, an index, and a length. The offset field indicates the position of a data element in the new data block or the related reference data block. For example, when data elements in the new data block and the related reference data block match, the offset field indicates the ending position of the matched one or more data elements in the new data block. Similarly, when a data element in the new data block does not match, the offset field indicates the position of the data element in the new data block that did not match a data element in the related reference data block. The flag field indicates whether a data element in the new data block has a match in the related reference data block. For example, the flag field may be set to 1 if a match is found in the related reference data block for a data element of the new data block and may be set to 0 if no match is found. The index field indicates the starting position of the matched string in the related reference data block. The length field indicates the total length of the matched string. The miss field indicates the data elements from the new data block which do not appear in the related reference data block (e.g., when the flag field is set to 0). For example, the miss field may store a physical or logical address for the data elements stored to a storage device. - As illustrated in the example of
FIG. 6 ,data elements 0 and 1 (Dw1 and Dw0) of new data block Blknewmatch data elements 7 and 8 (Dw1 and Dw0) of the related reference data block Blkref. The fields of the encoded data are set to indicate the data elements of the new data block that match the related reference data block (e.g., offset=1) whether a match is found (e.g., flag=1) the starting position of the matched data in the related reference data block (e.g., index=7), and the length of the matching data elements in the related reference data block Blkref (e.g., length=2). Thus, the output for the above described match may be encoded as (1,1,7,2) with a reference to the related reference data block, as shown in the example ofFIG. 6 . Similarly, the example encoding ofFIG. 6 shows data element 3 (e.g., Dw4) in Blknew has no match in Blkref, therefore, the fields of the encoded data indicate that the data element (e.g., offset=3) of the new data block does not have a match (e.g., flag=0), and includes a reference to the unique data (e.g., Dw4) stored on a storage device. As shown in the example ofFIG. 6 , the output may be encoded as (3,0, Dw4). -
Algorithm 1 below shows the process for single dictionary encoding. -
Algorithm 1: Single dictionary encoding if reference block then for i=block size-7 to 0 do Dictionary [i] = Blkref [i, i+1..., i+7] Hash table [hash_func (Blkref [i, i+1..., i+7]) ] = i end for else for i=block size/8 to 0 do Hash_index = Hash table [hash_func(Blknew [i×8..., i×8+7]) String match with Dictionary [Hash_index] Encoding end for end if
For single dictionary encoding, a line speed of 8 byte encoding is possible. - In some embodiments, both reference data block dictionary updating and new data block delta encoding can be processed in line speed by parallel computation in hardware design.
Algorithm 2 below shows the process for multiple dictionary encoding where a single large dictionary may be split into 8 smaller dictionaries such that multiple dictionaries may perform parallel store and search. -
Algorithm 2: Multiple dictionary encoding if reference block then for m=8 to 0 do for i=block size/8 to 0 do Dictionary [m][i] = Blkref [i+m..., i+m+7] Hash table [m][hash_func (Blkref [i+m..., i+m+7]) ] = i end for end for else for m=8 to 0 do for i=block size/8 to 0 do Hash_index [m] =Hash table [hash_func(Blknew [i×8..., i×8+7]) String match with Dictionary [m][Hash_index[m]] Encoding end for end for end if -
FIG. 7 illustrates a block diagram of a hardware decompression logic architecture. Based on the value offlag 703, a multiplexor (MUX) 720 selects either the value fromdictionary 718 or miss 704 and sends the selected value todecompression FIFO 730 for recovery of the delta compressed data. In one embodiment, thedictionary 718 or miss 704 stores a reference to data stored elsewhere and provides the reference to thedecompression FIFO 730. The value offlag 703 is determined by whether a string in a delta compressed data block has a match in a related reference data block. If there is a match, (e.g.,flag 703 holds the value 1),index 701 andlength 702 are used to produce the data stream or corresponding data elements from thedictionary 718. If there is no match (e.g.,flag 703 holds the value 0), theMUX 720 will forward the input frommiss data 704 to the decompression FIFO to retrieve the data for the delta compressed data block. The value ofmiss data 704 refers to the value of the data element in a delta compressed data block that did not have a match to a data element in a related reference data block. - In some embodiments, data block sketches,
e.g. reference sketch 310 andnew sketch 312, are derived by a Rabin fingerprint calculation for every fix-sized sliding window (e.g. 8 bytes long). In some embodiments, theblock signature module 210 processes multiple bits in one clock cycle to provide fingerprinting for high data rate applications. Using formal algebra, a single modulo operation (e.g., determining a Rabin fingerprint) can be turned into multiple calculations, each of which is responsible for one bit in the result. In the following examples, we assume the data string is 64 bits resulting in 16-bit Rabin fingerprints. - In one embodiment, to implement one of these equations in hardware, a combinatorial circuit may be used to computer an exclusive-OR (XOR) all of the corresponding input bits. The combination of these 16 circuits is referred to herein as a Fresh function.
- For applications of higher data rate, Rabin fingerprint computations are applied to all “shingles.” An example of these shingles is shown in
FIG. 8 .FIG. 8 depicts shingles in a data stream from α0 to α71, where (X) is the first shingle, and (X) is the second shingle. While the example ofFIG. 8 depicts a shift of one byte, shingles can shift in various other multiples of bits. In one embodiment, to treat all of the shingles in real-time, the Fresh function may be replicated over each shingle. However, it is evident that overlapping computations occur in this scheme. The relation between the Rabin fingerprints of A and B can be calculated as: -
Bmod P=(V+W·X 56)mod P -
Bmod P=((U−U)·(X −8 mod P)+V+W·X 56)mod P -
Bmod P=(−U·(X −8 mod P))mod P+((X −8 mod P)·(U+V·X 8))mod P+(W·X 56)mod P -
Bmod P=(W·X 56 −U·(X −8 mod P))mod P+((X −8 mod P)·(U+V·X 8))mod P -
Bmod P=(W·X 56 −U·(X −8 mod P))mod P+((X −8)mod P)·(U+V·X 8)mod P)mod P -
Let x −8 =X −8 mod P -
B mod P=(W·X 56 −U·x −8)mod P+(x −8 ·A mod P)mod P - As can be seen, the fingerprint of the new shingle B(x) is dependent on the fingerprint of the old shingle A(x), the first byte of the old shingle U(x), and the first byte of incoming data W(x), which is the last byte of the new shingle B(x). Thus, the fingerprint calculation of each shingle can be optimized using the fingerprint calculation of the previous shingle.
- Using a 64-bit wide data bus and a 64-bit shingle as an example, an incremental computation pipeline design is illustrated in
FIG. 9 . The data is drawn from two consecutive clock cycles, for example (α0, α1 , . . . , α63) from the preceding cycle and (α64, α65, . . . α127) from the following cycle. - In some embodiments, the techniques disclosed herein include finding an irreducible polynomial for which Rabin fingerprint computation has the least amount of operations for one full computation and several incremental computations of a multiple byte data shingle to group the data in a stream (e.g., seven incremental computations for an eight byte data shingle). The techniques further include computing a Rabin fingerprint incrementally using the selected irreducible polynomial. For example, incremental computation may allow computation of a fingerprint to reuse calculations results from a previous fingerprint calculation of eight bytes. As an example, the fingerprint calculation may calculate the fingerprint of all eight bytes numbered zero to seven, and may shift one byte to the right for a next clock cycle. On the next clock cycle the calculations for bytes zero to seven may be reused and the calculations involving byte eight, and byte zero may be performed. Thus, the fingerprint for the shingle of bytes one to eight may be performed incrementally, reusing the calculations of the prior fingerprint for eight bytes and performing new calculations.
-
FIG. 10 is a block diagram illustrating an exampleblock signature module 210. The exampleblock signature module 210 includes afingerprint pipeline 1002, a number of sampling modules 1004 a-1004 n, and afingerprint selection module 1006. In the example single pipeline design depicted inFIG. 10 ,data 1008 flows from top to bottom through the fingerprint pipeline. The total number of fingerprints generated for a w-byte data chunk according to the techniques disclose here is w−b+ 1, where b is the size of the shingles. In some embodiments, to reduce the number of fingerprints compared by the deduplication modules, several fingerprints may be chosen from among all of the fingerprints as a sketch to represent the data chunk. In one embodiment, fingerprints with upper N bits having a specific pattern are selected for the sketch since these upper bits in each fingerprint can be considered as randomly distributed. The result of this selection is a good choice in terms of balancing processing speed, similarity detection, elimination of false positives, and resolution. - Fingerprint results produced at every pipeline stage are sent to the right for the corresponding channel sampling modules to process. As the data chunk runs through the pipeline, the fingerprints are sampled and stored in an intermediate buffer. After the sampling for a data chunk is done, the fingerprint selection module will choose from the intermediate samples and returns a sketch for the data block. In some embodiments, the pipeline is composed of one Fresh function and several following Shift functions.
- Systems and methods for implementing a hardware architecture of a delta compression engine for similarity based data deduplications are described below. In the above description, for purposes of explanation, numerous specific details were set forth. It will be apparent, however, that the disclosed technologies can be practiced without any given subset of these specific details. In other instances, structures and devices are shown in block diagram form. For example, the disclosed technologies are described in some embodiments above with reference to user interfaces and particular hardware. Moreover, the technologies disclosed above primarily in the context of on line services; however, the disclosed technologies apply to other data sources and other data types (e.g., collections of other resources for example images, audio, web pages).
- Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosed technologies. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
- Some portions of the detailed descriptions above were presented in terms of processes and symbolic representations of operations on data bits within a computer memory. A process can generally be considered a self-consistent sequence of steps leading to a result. The steps may involve physical manipulations of physical quantities. These quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. These signals may be referred to as being in the form of bits, values, elements, symbols, characters, terms, numbers or the like.
- These and similar terms can be associated with the appropriate physical quantities and can be considered labels applied to these quantities. Unless specifically stated otherwise as apparent from the prior discussion, it is appreciated that throughout the description, discussions utilizing terms for example “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, may refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
- The disclosed technologies may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may include a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, for example, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, flash memories including USB keys with non-volatile memory or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
- The disclosed technologies can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In some embodiments, the technology is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
- Furthermore, the disclosed technologies can take the form of a computer program product accessible from a non-transitory computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer-readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
- A computing system or data processing system suitable for storing and/or executing program code will include at least one processor (e.g., a hardware processor) coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
- Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.
- Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modems and Ethernet cards are just a few of the currently available types of network adapters.
- Finally, the processes and displays presented herein may not be inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear from the description below. In addition, the disclosed technologies were not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the technologies as described herein.
- The foregoing description of the embodiments of the present techniques and technologies has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the present techniques and technologies to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the present techniques and technologies be limited not by this detailed description. The present techniques and technologies may be implemented in other specific forms without departing from the spirit or essential characteristics thereof. Likewise, the particular naming and division of the modules, routines, features, attributes, methodologies and other aspects are not mandatory or significant, and the mechanisms that implement the present techniques and technologies or its features may have different names, divisions and/or formats. Furthermore, the modules, routines, features, attributes, methodologies and other aspects of the present technology can be implemented as software, hardware, firmware or any combination of the three. Also, wherever a component, an example of which is a module, is implemented as software, the component can be implemented as a standalone program, as part of a larger program, as a plurality of separate programs, as a statically or dynamically linked library, as a kernel loadable module, as a device driver, and/or in every and any other way known now or in the future in computer programming. Additionally, the present techniques and technologies are in no way limited to embodiment in any specific programming language, or for any specific operating system or environment. Accordingly, the disclosure of the present techniques and technologies is intended to be illustrative, but not limiting.
Claims (23)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US15/214,243 US20170038978A1 (en) | 2015-08-05 | 2016-07-19 | Delta Compression Engine for Similarity Based Data Deduplication |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201562201493P | 2015-08-05 | 2015-08-05 | |
| US15/214,243 US20170038978A1 (en) | 2015-08-05 | 2016-07-19 | Delta Compression Engine for Similarity Based Data Deduplication |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| US20170038978A1 true US20170038978A1 (en) | 2017-02-09 |
Family
ID=58053750
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/214,243 Abandoned US20170038978A1 (en) | 2015-08-05 | 2016-07-19 | Delta Compression Engine for Similarity Based Data Deduplication |
Country Status (1)
| Country | Link |
|---|---|
| US (1) | US20170038978A1 (en) |
Cited By (52)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20190095469A1 (en) * | 2017-09-28 | 2019-03-28 | Intel Corporation | Multiple order delta compression |
| US10282127B2 (en) | 2017-04-20 | 2019-05-07 | Western Digital Technologies, Inc. | Managing data in a storage system |
| US10789211B1 (en) * | 2017-10-04 | 2020-09-29 | Pure Storage, Inc. | Feature-based deduplication |
| US10904009B2 (en) | 2018-05-30 | 2021-01-26 | International Business Machines Corporation | Blockchain implementing delta storage |
| US11182359B2 (en) | 2020-01-10 | 2021-11-23 | International Business Machines Corporation | Data deduplication in data platforms |
| US20210397350A1 (en) * | 2019-06-17 | 2021-12-23 | Huawei Technologies Co., Ltd. | Data Processing Method and Apparatus, and Computer-Readable Storage Medium |
| US20220129184A1 (en) * | 2020-10-26 | 2022-04-28 | EMC IP Holding Company LLC | Data deduplication (dedup) management |
| CN114415955A (en) * | 2022-01-05 | 2022-04-29 | 上海交通大学 | Fingerprint-based block granularity data deduplication system and method |
| US11327741B2 (en) * | 2019-07-31 | 2022-05-10 | Sony Interactive Entertainment Inc. | Information processing apparatus |
| US20220147255A1 (en) * | 2019-07-22 | 2022-05-12 | Huawei Technologies Co., Ltd. | Method and apparatus for compressing data of storage system, device, and readable storage medium |
| US20220147256A1 (en) * | 2019-07-26 | 2022-05-12 | Huawei Technologies Co., Ltd. | Data Deduplication Method and Apparatus, and Computer Program Product |
| US11334268B2 (en) | 2020-01-10 | 2022-05-17 | International Business Machines Corporation | Data lineage and data provenance enhancement |
| US20220197527A1 (en) * | 2020-12-23 | 2022-06-23 | Hitachi, Ltd. | Storage system and method of data amount reduction in storage system |
| WO2022139626A1 (en) * | 2020-12-22 | 2022-06-30 | Huawei Technologies Co., Ltd. | Method for storing a data page in a data storage device using similarity based data reduction |
| WO2022159127A1 (en) * | 2021-01-22 | 2022-07-28 | EMC IP Holding Company LLC | Similarity deduplication |
| US20220253222A1 (en) * | 2019-11-01 | 2022-08-11 | Huawei Technologies Co., Ltd. | Data reduction method, apparatus, computing device, and storage medium |
| US20220317915A1 (en) * | 2021-04-06 | 2022-10-06 | EMC IP Holding Company LLC | Data expiration for stream storages |
| US20220342574A1 (en) * | 2021-04-23 | 2022-10-27 | EMC IP Holding Company LLC | Extending similarity-based deduplication to adjacent data |
| US20220382643A1 (en) * | 2009-05-22 | 2022-12-01 | Commvault Systems, Inc. | Block-level single instancing |
| US11550480B2 (en) * | 2019-07-08 | 2023-01-10 | Continental Teves Ag & Co. Ohg | Method of identifying errors in or manipulations of data or software stored in a device |
| US11599546B2 (en) | 2020-05-01 | 2023-03-07 | EMC IP Holding Company LLC | Stream browser for data streams |
| US11599293B2 (en) | 2020-10-14 | 2023-03-07 | EMC IP Holding Company LLC | Consistent data stream replication and reconstruction in a streaming data storage platform |
| US11599420B2 (en) | 2020-07-30 | 2023-03-07 | EMC IP Holding Company LLC | Ordered event stream event retention |
| US11604788B2 (en) | 2019-01-24 | 2023-03-14 | EMC IP Holding Company LLC | Storing a non-ordered associative array of pairs using an append-only storage medium |
| US11604759B2 (en) | 2020-05-01 | 2023-03-14 | EMC IP Holding Company LLC | Retention management for data streams |
| US20230089018A1 (en) * | 2021-09-23 | 2023-03-23 | EMC IP Holding Company LLC | Method or apparatus to integrate physical file verification and garbage collection (gc) by tracking special segments |
| US11681460B2 (en) | 2021-06-03 | 2023-06-20 | EMC IP Holding Company LLC | Scaling of an ordered event stream based on a writer group characteristic |
| US20230195351A1 (en) * | 2021-12-17 | 2023-06-22 | Samsung Electronics Co., Ltd. | Automatic deletion in a persistent storage device |
| US20230221864A1 (en) * | 2022-01-10 | 2023-07-13 | Vmware, Inc. | Efficient inline block-level deduplication using a bloom filter and a small in-memory deduplication hash table |
| US20230229329A1 (en) * | 2022-01-20 | 2023-07-20 | Dell Products L.P. | Time-series data deduplication (dedup) caching |
| US11735282B2 (en) | 2021-07-22 | 2023-08-22 | EMC IP Holding Company LLC | Test data verification for an ordered event stream storage system |
| US11748307B2 (en) * | 2021-10-13 | 2023-09-05 | EMC IP Holding Company LLC | Selective data compression based on data similarity |
| US20230280922A1 (en) * | 2020-07-02 | 2023-09-07 | Intel Corporation | Methods and apparatus to deduplicate duplicate memory in a cloud computing environment |
| US11755207B2 (en) | 2019-12-18 | 2023-09-12 | Huawei Technologies Co., Ltd. | Data storage method in storage system and related device |
| US11755555B2 (en) | 2020-10-06 | 2023-09-12 | EMC IP Holding Company LLC | Storing an ordered associative array of pairs using an append-only storage medium |
| US11762715B2 (en) | 2020-09-30 | 2023-09-19 | EMC IP Holding Company LLC | Employing triggered retention in an ordered event stream storage system |
| US11789639B1 (en) * | 2022-07-20 | 2023-10-17 | Zhejiang Lab | Method and apparatus for screening TB-scale incremental data |
| US11816065B2 (en) | 2021-01-11 | 2023-11-14 | EMC IP Holding Company LLC | Event level retention management for data streams |
| US20230367477A1 (en) * | 2022-05-12 | 2023-11-16 | Hitachi, Ltd. | Storage system, data management program, and data management method |
| CN117216022A (en) * | 2023-11-07 | 2023-12-12 | 湖南省致诚工程咨询有限公司 | Digital engineering consultation data management system |
| US20230418497A1 (en) * | 2021-03-09 | 2023-12-28 | Huawei Technologies Co., Ltd. | Memory Controller and Method for Improved Deduplication |
| WO2024030040A1 (en) * | 2022-08-02 | 2024-02-08 | Huawei Technologies Co., Ltd | Method for date compression and related device |
| US11954537B2 (en) | 2021-04-22 | 2024-04-09 | EMC IP Holding Company LLC | Information-unit based scaling of an ordered event stream |
| US11971850B2 (en) | 2021-10-15 | 2024-04-30 | EMC IP Holding Company LLC | Demoted data retention via a tiered ordered event stream data storage system |
| US12001881B2 (en) | 2021-04-12 | 2024-06-04 | EMC IP Holding Company LLC | Event prioritization for an ordered event stream |
| US12007948B1 (en) * | 2022-07-31 | 2024-06-11 | Vast Data Ltd. | Similarity based compression |
| US12093190B2 (en) * | 2019-11-08 | 2024-09-17 | Nec Corporation | Recordation of data in accordance with data compression method and counting reading of the data in accordance with data counting method |
| US20240311013A1 (en) * | 2021-11-25 | 2024-09-19 | Huawei Technologies Co., Ltd. | Data storage system, intelligent network interface card, and compute node |
| US12099513B2 (en) | 2021-01-19 | 2024-09-24 | EMC IP Holding Company LLC | Ordered event stream event annulment in an ordered event stream storage system |
| US12271591B2 (en) * | 2020-07-09 | 2025-04-08 | Huawei Technologies Co., Ltd. | Data reduction method and apparatus |
| US12282673B2 (en) * | 2023-03-23 | 2025-04-22 | International Business Machines Corporation | Limiting deduplication search domains |
| US20250244900A1 (en) * | 2024-01-26 | 2025-07-31 | Samsung Electronics Co., Ltd. | Method for data deduplication of storage apparatus and storage apparatus |
Citations (25)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6374250B2 (en) * | 1997-02-03 | 2002-04-16 | International Business Machines Corporation | System and method for differential compression of data from a plurality of binary sources |
| US20030005235A1 (en) * | 2001-07-02 | 2003-01-02 | Sun Microsystems, Inc. | Computer storage systems |
| US6981114B1 (en) * | 2002-10-16 | 2005-12-27 | Veritas Operating Corporation | Snapshot reconstruction from an existing snapshot and one or more modification logs |
| US20060059207A1 (en) * | 2004-09-15 | 2006-03-16 | Diligent Technologies Corporation | Systems and methods for searching of storage data with reduced bandwidth requirements |
| US20060136365A1 (en) * | 2004-04-26 | 2006-06-22 | Storewiz Inc. | Method and system for compression of data for block mode access storage |
| US20060184505A1 (en) * | 2004-04-26 | 2006-08-17 | Storewiz, Inc. | Method and system for compression of files for storage and operation on compressed files |
| US20060218364A1 (en) * | 2005-03-24 | 2006-09-28 | Hitachi, Ltd. | Method and apparatus for monitoring the quantity of differential data in a storage system |
| US20070027867A1 (en) * | 2005-07-28 | 2007-02-01 | Nec Corporation | Pattern matching apparatus and method |
| US7472242B1 (en) * | 2006-02-14 | 2008-12-30 | Network Appliance, Inc. | Eliminating duplicate blocks during backup writes |
| US7523098B2 (en) * | 2004-09-15 | 2009-04-21 | International Business Machines Corporation | Systems and methods for efficient data searching, storage and reduction |
| US20090171990A1 (en) * | 2007-12-28 | 2009-07-02 | Naef Iii Frederick E | Apparatus and methods of identifying potentially similar content for data reduction |
| US20100125553A1 (en) * | 2008-11-14 | 2010-05-20 | Data Domain, Inc. | Delta compression after identity deduplication |
| US20100281208A1 (en) * | 2009-04-30 | 2010-11-04 | Qing Yang | System and Method for Data Storage |
| US20100318759A1 (en) * | 2009-06-15 | 2010-12-16 | Microsoft Corporation | Distributed rdc chunk store |
| US20120137059A1 (en) * | 2009-04-30 | 2012-05-31 | Velobit, Inc. | Content locality-based caching in a data storage system |
| US20130243190A1 (en) * | 2009-04-30 | 2013-09-19 | Velobit, Inc. | Optimizing signature computation and sampling for fast adaptive similarity detection based on algorithm-specific performance |
| US20140189279A1 (en) * | 2013-01-02 | 2014-07-03 | Man Keun Seo | Method of compressing data and device for performing the same |
| US20150010143A1 (en) * | 2009-04-30 | 2015-01-08 | HGST Netherlands B.V. | Systems and methods for signature computation in a content locality based cache |
| US8972672B1 (en) * | 2012-06-13 | 2015-03-03 | Emc Corporation | Method for cleaning a delta storage system |
| US9141301B1 (en) * | 2012-06-13 | 2015-09-22 | Emc Corporation | Method for cleaning a delta storage system |
| US20160182088A1 (en) * | 2014-12-19 | 2016-06-23 | Aalborg Universitet | Method For File Updating And Version Control For Linear Erasure Coded And Network Coded Storage |
| US9400610B1 (en) * | 2012-06-13 | 2016-07-26 | Emc Corporation | Method for cleaning a delta storage system |
| US20160328154A1 (en) * | 2014-02-26 | 2016-11-10 | Hitachi, Ltd. | Storage device, apparatus having storage device, and storage control method |
| US20160335024A1 (en) * | 2015-05-15 | 2016-11-17 | ScaleFlux | Assisting data deduplication through in-memory computation |
| US20170017650A1 (en) * | 2015-07-16 | 2017-01-19 | Quantum Metric, LLC | Document capture using client-based delta encoding with server |
-
2016
- 2016-07-19 US US15/214,243 patent/US20170038978A1/en not_active Abandoned
Patent Citations (26)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6374250B2 (en) * | 1997-02-03 | 2002-04-16 | International Business Machines Corporation | System and method for differential compression of data from a plurality of binary sources |
| US20030005235A1 (en) * | 2001-07-02 | 2003-01-02 | Sun Microsystems, Inc. | Computer storage systems |
| US6981114B1 (en) * | 2002-10-16 | 2005-12-27 | Veritas Operating Corporation | Snapshot reconstruction from an existing snapshot and one or more modification logs |
| US20060184505A1 (en) * | 2004-04-26 | 2006-08-17 | Storewiz, Inc. | Method and system for compression of files for storage and operation on compressed files |
| US20060136365A1 (en) * | 2004-04-26 | 2006-06-22 | Storewiz Inc. | Method and system for compression of data for block mode access storage |
| US7523098B2 (en) * | 2004-09-15 | 2009-04-21 | International Business Machines Corporation | Systems and methods for efficient data searching, storage and reduction |
| US8725705B2 (en) * | 2004-09-15 | 2014-05-13 | International Business Machines Corporation | Systems and methods for searching of storage data with reduced bandwidth requirements |
| US20060059207A1 (en) * | 2004-09-15 | 2006-03-16 | Diligent Technologies Corporation | Systems and methods for searching of storage data with reduced bandwidth requirements |
| US20060218364A1 (en) * | 2005-03-24 | 2006-09-28 | Hitachi, Ltd. | Method and apparatus for monitoring the quantity of differential data in a storage system |
| US20070027867A1 (en) * | 2005-07-28 | 2007-02-01 | Nec Corporation | Pattern matching apparatus and method |
| US7472242B1 (en) * | 2006-02-14 | 2008-12-30 | Network Appliance, Inc. | Eliminating duplicate blocks during backup writes |
| US20090171990A1 (en) * | 2007-12-28 | 2009-07-02 | Naef Iii Frederick E | Apparatus and methods of identifying potentially similar content for data reduction |
| US20100125553A1 (en) * | 2008-11-14 | 2010-05-20 | Data Domain, Inc. | Delta compression after identity deduplication |
| US20100281208A1 (en) * | 2009-04-30 | 2010-11-04 | Qing Yang | System and Method for Data Storage |
| US20120137059A1 (en) * | 2009-04-30 | 2012-05-31 | Velobit, Inc. | Content locality-based caching in a data storage system |
| US20130243190A1 (en) * | 2009-04-30 | 2013-09-19 | Velobit, Inc. | Optimizing signature computation and sampling for fast adaptive similarity detection based on algorithm-specific performance |
| US20150010143A1 (en) * | 2009-04-30 | 2015-01-08 | HGST Netherlands B.V. | Systems and methods for signature computation in a content locality based cache |
| US20100318759A1 (en) * | 2009-06-15 | 2010-12-16 | Microsoft Corporation | Distributed rdc chunk store |
| US8972672B1 (en) * | 2012-06-13 | 2015-03-03 | Emc Corporation | Method for cleaning a delta storage system |
| US9141301B1 (en) * | 2012-06-13 | 2015-09-22 | Emc Corporation | Method for cleaning a delta storage system |
| US9400610B1 (en) * | 2012-06-13 | 2016-07-26 | Emc Corporation | Method for cleaning a delta storage system |
| US20140189279A1 (en) * | 2013-01-02 | 2014-07-03 | Man Keun Seo | Method of compressing data and device for performing the same |
| US20160328154A1 (en) * | 2014-02-26 | 2016-11-10 | Hitachi, Ltd. | Storage device, apparatus having storage device, and storage control method |
| US20160182088A1 (en) * | 2014-12-19 | 2016-06-23 | Aalborg Universitet | Method For File Updating And Version Control For Linear Erasure Coded And Network Coded Storage |
| US20160335024A1 (en) * | 2015-05-15 | 2016-11-17 | ScaleFlux | Assisting data deduplication through in-memory computation |
| US20170017650A1 (en) * | 2015-07-16 | 2017-01-19 | Quantum Metric, LLC | Document capture using client-based delta encoding with server |
Cited By (76)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20230367678A1 (en) * | 2009-05-22 | 2023-11-16 | Commvault Systems, Inc. | Block-level single instancing |
| US20220382643A1 (en) * | 2009-05-22 | 2022-12-01 | Commvault Systems, Inc. | Block-level single instancing |
| US11709739B2 (en) * | 2009-05-22 | 2023-07-25 | Commvault Systems, Inc. | Block-level single instancing |
| US10282127B2 (en) | 2017-04-20 | 2019-05-07 | Western Digital Technologies, Inc. | Managing data in a storage system |
| US20190095469A1 (en) * | 2017-09-28 | 2019-03-28 | Intel Corporation | Multiple order delta compression |
| US11442910B2 (en) * | 2017-09-28 | 2022-09-13 | Intel Corporation | Multiple order delta compression |
| US11537563B2 (en) | 2017-10-04 | 2022-12-27 | Pure Storage, Inc. | Determining content-dependent deltas between data sectors |
| US12242425B2 (en) * | 2017-10-04 | 2025-03-04 | Pure Storage, Inc. | Similarity data for reduced data usage |
| US10789211B1 (en) * | 2017-10-04 | 2020-09-29 | Pure Storage, Inc. | Feature-based deduplication |
| US20230088163A1 (en) * | 2017-10-04 | 2023-03-23 | Pure Storage, Inc. | Similarity data for reduced data usage |
| US10904009B2 (en) | 2018-05-30 | 2021-01-26 | International Business Machines Corporation | Blockchain implementing delta storage |
| US11604788B2 (en) | 2019-01-24 | 2023-03-14 | EMC IP Holding Company LLC | Storing a non-ordered associative array of pairs using an append-only storage medium |
| US11797204B2 (en) * | 2019-06-17 | 2023-10-24 | Huawei Technologies Co., Ltd. | Data compression processing method and apparatus, and computer-readable storage medium |
| US20210397350A1 (en) * | 2019-06-17 | 2021-12-23 | Huawei Technologies Co., Ltd. | Data Processing Method and Apparatus, and Computer-Readable Storage Medium |
| US11550480B2 (en) * | 2019-07-08 | 2023-01-10 | Continental Teves Ag & Co. Ohg | Method of identifying errors in or manipulations of data or software stored in a device |
| US20220147255A1 (en) * | 2019-07-22 | 2022-05-12 | Huawei Technologies Co., Ltd. | Method and apparatus for compressing data of storage system, device, and readable storage medium |
| US12073102B2 (en) * | 2019-07-22 | 2024-08-27 | Huawei Technologies Co., Ltd. | Method and apparatus for compressing data of storage system, device, and readable storage medium |
| US20230333764A1 (en) * | 2019-07-22 | 2023-10-19 | Huawei Technologies Co., Ltd. | Method and apparatus for compressing data of storage system, device, and readable storage medium |
| US20220300180A1 (en) * | 2019-07-26 | 2022-09-22 | Huawei Technologies Co., Ltd. | Data Deduplication Method and Apparatus, and Computer Program Product |
| US12242747B2 (en) * | 2019-07-26 | 2025-03-04 | Huawei Technologies Co., Ltd. | Data deduplication method and apparatus, and computer program product |
| US12019890B2 (en) * | 2019-07-26 | 2024-06-25 | Huawei Technologies Co., Ltd. | Adjustable deduplication method, apparatus, and computer program product |
| US20220147256A1 (en) * | 2019-07-26 | 2022-05-12 | Huawei Technologies Co., Ltd. | Data Deduplication Method and Apparatus, and Computer Program Product |
| US11327741B2 (en) * | 2019-07-31 | 2022-05-10 | Sony Interactive Entertainment Inc. | Information processing apparatus |
| US20220253222A1 (en) * | 2019-11-01 | 2022-08-11 | Huawei Technologies Co., Ltd. | Data reduction method, apparatus, computing device, and storage medium |
| US12079472B2 (en) * | 2019-11-01 | 2024-09-03 | Huawei Technologies Co., Ltd. | Data reduction method, apparatus, computing device, and storage medium for forming index information based on fingerprints |
| US12093190B2 (en) * | 2019-11-08 | 2024-09-17 | Nec Corporation | Recordation of data in accordance with data compression method and counting reading of the data in accordance with data counting method |
| US11755207B2 (en) | 2019-12-18 | 2023-09-12 | Huawei Technologies Co., Ltd. | Data storage method in storage system and related device |
| US11182359B2 (en) | 2020-01-10 | 2021-11-23 | International Business Machines Corporation | Data deduplication in data platforms |
| US11586598B2 (en) | 2020-01-10 | 2023-02-21 | International Business Machines Corporation | Data deduplication in data platforms |
| US11334268B2 (en) | 2020-01-10 | 2022-05-17 | International Business Machines Corporation | Data lineage and data provenance enhancement |
| US11599546B2 (en) | 2020-05-01 | 2023-03-07 | EMC IP Holding Company LLC | Stream browser for data streams |
| US11604759B2 (en) | 2020-05-01 | 2023-03-14 | EMC IP Holding Company LLC | Retention management for data streams |
| US11960441B2 (en) | 2020-05-01 | 2024-04-16 | EMC IP Holding Company LLC | Retention management for data streams |
| US12056380B2 (en) * | 2020-07-02 | 2024-08-06 | Intel Corporation | Methods and apparatus to deduplicate duplicate memory in a cloud computing environment |
| US20230280922A1 (en) * | 2020-07-02 | 2023-09-07 | Intel Corporation | Methods and apparatus to deduplicate duplicate memory in a cloud computing environment |
| US12271591B2 (en) * | 2020-07-09 | 2025-04-08 | Huawei Technologies Co., Ltd. | Data reduction method and apparatus |
| US11599420B2 (en) | 2020-07-30 | 2023-03-07 | EMC IP Holding Company LLC | Ordered event stream event retention |
| US11762715B2 (en) | 2020-09-30 | 2023-09-19 | EMC IP Holding Company LLC | Employing triggered retention in an ordered event stream storage system |
| US11755555B2 (en) | 2020-10-06 | 2023-09-12 | EMC IP Holding Company LLC | Storing an ordered associative array of pairs using an append-only storage medium |
| US11599293B2 (en) | 2020-10-14 | 2023-03-07 | EMC IP Holding Company LLC | Consistent data stream replication and reconstruction in a streaming data storage platform |
| US20220129184A1 (en) * | 2020-10-26 | 2022-04-28 | EMC IP Holding Company LLC | Data deduplication (dedup) management |
| US11698744B2 (en) * | 2020-10-26 | 2023-07-11 | EMC IP Holding Company LLC | Data deduplication (dedup) management |
| WO2022139626A1 (en) * | 2020-12-22 | 2022-06-30 | Huawei Technologies Co., Ltd. | Method for storing a data page in a data storage device using similarity based data reduction |
| US20220197527A1 (en) * | 2020-12-23 | 2022-06-23 | Hitachi, Ltd. | Storage system and method of data amount reduction in storage system |
| US11816065B2 (en) | 2021-01-11 | 2023-11-14 | EMC IP Holding Company LLC | Event level retention management for data streams |
| US12099513B2 (en) | 2021-01-19 | 2024-09-24 | EMC IP Holding Company LLC | Ordered event stream event annulment in an ordered event stream storage system |
| US11615063B2 (en) | 2021-01-22 | 2023-03-28 | EMC IP Holding Company LLC | Similarity deduplication |
| WO2022159127A1 (en) * | 2021-01-22 | 2022-07-28 | EMC IP Holding Company LLC | Similarity deduplication |
| US20230418497A1 (en) * | 2021-03-09 | 2023-12-28 | Huawei Technologies Co., Ltd. | Memory Controller and Method for Improved Deduplication |
| US12307112B2 (en) * | 2021-03-09 | 2025-05-20 | Huawei Technologies Co., Ltd. | Memory controller and method for detecting data similarity for deduplication in a data storage system |
| US11740828B2 (en) * | 2021-04-06 | 2023-08-29 | EMC IP Holding Company LLC | Data expiration for stream storages |
| US20220317915A1 (en) * | 2021-04-06 | 2022-10-06 | EMC IP Holding Company LLC | Data expiration for stream storages |
| US12001881B2 (en) | 2021-04-12 | 2024-06-04 | EMC IP Holding Company LLC | Event prioritization for an ordered event stream |
| US11954537B2 (en) | 2021-04-22 | 2024-04-09 | EMC IP Holding Company LLC | Information-unit based scaling of an ordered event stream |
| US20220342574A1 (en) * | 2021-04-23 | 2022-10-27 | EMC IP Holding Company LLC | Extending similarity-based deduplication to adjacent data |
| US11748015B2 (en) * | 2021-04-23 | 2023-09-05 | EMC IP Holding Company LLC | Extending similarity-based deduplication to adjacent data |
| US11681460B2 (en) | 2021-06-03 | 2023-06-20 | EMC IP Holding Company LLC | Scaling of an ordered event stream based on a writer group characteristic |
| US11735282B2 (en) | 2021-07-22 | 2023-08-22 | EMC IP Holding Company LLC | Test data verification for an ordered event stream storage system |
| US20230089018A1 (en) * | 2021-09-23 | 2023-03-23 | EMC IP Holding Company LLC | Method or apparatus to integrate physical file verification and garbage collection (gc) by tracking special segments |
| US11847334B2 (en) * | 2021-09-23 | 2023-12-19 | EMC IP Holding Company LLC | Method or apparatus to integrate physical file verification and garbage collection (GC) by tracking special segments |
| US11748307B2 (en) * | 2021-10-13 | 2023-09-05 | EMC IP Holding Company LLC | Selective data compression based on data similarity |
| US11971850B2 (en) | 2021-10-15 | 2024-04-30 | EMC IP Holding Company LLC | Demoted data retention via a tiered ordered event stream data storage system |
| US20240311013A1 (en) * | 2021-11-25 | 2024-09-19 | Huawei Technologies Co., Ltd. | Data storage system, intelligent network interface card, and compute node |
| US12124727B2 (en) * | 2021-12-17 | 2024-10-22 | Samsung Electronics Co., Ltd. | Automatic deletion in a persistent storage device |
| US20230195351A1 (en) * | 2021-12-17 | 2023-06-22 | Samsung Electronics Co., Ltd. | Automatic deletion in a persistent storage device |
| CN114415955A (en) * | 2022-01-05 | 2022-04-29 | 上海交通大学 | Fingerprint-based block granularity data deduplication system and method |
| US20230221864A1 (en) * | 2022-01-10 | 2023-07-13 | Vmware, Inc. | Efficient inline block-level deduplication using a bloom filter and a small in-memory deduplication hash table |
| US20230229329A1 (en) * | 2022-01-20 | 2023-07-20 | Dell Products L.P. | Time-series data deduplication (dedup) caching |
| US11880577B2 (en) * | 2022-01-20 | 2024-01-23 | Dell Products L.P. | Time-series data deduplication (dedupe) caching |
| US20230367477A1 (en) * | 2022-05-12 | 2023-11-16 | Hitachi, Ltd. | Storage system, data management program, and data management method |
| US11789639B1 (en) * | 2022-07-20 | 2023-10-17 | Zhejiang Lab | Method and apparatus for screening TB-scale incremental data |
| US12007948B1 (en) * | 2022-07-31 | 2024-06-11 | Vast Data Ltd. | Similarity based compression |
| WO2024030040A1 (en) * | 2022-08-02 | 2024-02-08 | Huawei Technologies Co., Ltd | Method for date compression and related device |
| US12282673B2 (en) * | 2023-03-23 | 2025-04-22 | International Business Machines Corporation | Limiting deduplication search domains |
| CN117216022A (en) * | 2023-11-07 | 2023-12-12 | 湖南省致诚工程咨询有限公司 | Digital engineering consultation data management system |
| US20250244900A1 (en) * | 2024-01-26 | 2025-07-31 | Samsung Electronics Co., Ltd. | Method for data deduplication of storage apparatus and storage apparatus |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20170038978A1 (en) | Delta Compression Engine for Similarity Based Data Deduplication | |
| US10778246B2 (en) | Managing compression and storage of genomic data | |
| US12189693B2 (en) | Method and system for document similarity analysis | |
| US9286313B1 (en) | Efficient lossless reduction of data by deriving data from prime data elements resident in a content-associative sieve | |
| US10187081B1 (en) | Dictionary preload for data compression | |
| JP6381546B2 (en) | Method for pipelined compression of multi-byte frames, apparatus for high bandwidth compression into an encoded data stream, and computer program product | |
| US10224957B1 (en) | Hash-based data matching enhanced with backward matching for data compression | |
| US20250047300A1 (en) | System and method for data processing and transformation using reference data structures | |
| US11675768B2 (en) | Compression/decompression using index correlating uncompressed/compressed content | |
| US11868616B2 (en) | System and method for low-distortion compaction of floating-point numbers | |
| CN105844210B (en) | Hardware efficient fingerprinting | |
| US9137336B1 (en) | Data compression techniques | |
| US12436920B2 (en) | System and method for file type identification using machine learning | |
| KR20170040343A (en) | Adaptive rate compression hash processing device | |
| US20130185319A1 (en) | Compression pattern matching | |
| US20250139060A1 (en) | System and method for intelligent data access and analysis | |
| US11748307B2 (en) | Selective data compression based on data similarity | |
| US8988258B2 (en) | Hardware compression using common portions of data | |
| CN105843837B (en) | Hardware efficient rabin fingerprinting | |
| CN114691813A (en) | Data transmission method and device, electronic equipment and computer readable storage medium | |
| US12099475B2 (en) | System and method for random-access manipulation of compacted data files | |
| US20240354342A1 (en) | Compact Probabilistic Data Structure For Storing Streamed Log Lines | |
| CN120238134A (en) | Data compression method and data decompression method | |
| WO2025128604A1 (en) | Lazy matching algorithm for data compression | |
| CN116112122A (en) | Data compression method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: HGST NETHERLANDS B.V., NETHERLANDS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LI, DONGYANG;WANG, QINGBO;BANDIC, ZVONIMIR Z;AND OTHERS;SIGNING DATES FROM 20160720 TO 20160721;REEL/FRAME:039291/0369 |
|
| AS | Assignment |
Owner name: WESTERN DIGITAL TECHNOLOGIES, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:HGST NETHERLANDS B.V.;REEL/FRAME:040831/0265 Effective date: 20160831 |
|
| AS | Assignment |
Owner name: WESTERN DIGITAL TECHNOLOGIES, INC., CALIFORNIA Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE INCORRECT SERIAL NO 15/025,946 PREVIOUSLY RECORDED AT REEL: 040831 FRAME: 0265. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNOR:HGST NETHERLANDS B.V.;REEL/FRAME:043973/0762 Effective date: 20160831 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
| STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |