WO2023136740A1 - Device and method for similarity detection of compressed data - Google Patents

Device and method for similarity detection of compressed data Download PDF

Info

Publication number
WO2023136740A1
WO2023136740A1 PCT/RU2022/000006 RU2022000006W WO2023136740A1 WO 2023136740 A1 WO2023136740 A1 WO 2023136740A1 RU 2022000006 W RU2022000006 W RU 2022000006W WO 2023136740 A1 WO2023136740 A1 WO 2023136740A1
Authority
WO
WIPO (PCT)
Prior art keywords
data block
incoming
data
incompressible
block
Prior art date
Application number
PCT/RU2022/000006
Other languages
French (fr)
Inventor
Aleksei Valentinovich ROMANOVSKII
Vitaliy Sergeevich KHARIN
Sergey Alexandrovich CHERNOV
Artem Valentinovich KUZMITSKIY
Denis Sergeevich TARAKANOV
Denis Yurievich ARKHIPOV
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/RU2022/000006 priority Critical patent/WO2023136740A1/en
Publication of WO2023136740A1 publication Critical patent/WO2023136740A1/en

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0679Non-volatile semiconductor memory device, e.g. flash memory, one time programmable memory [OTP]
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3088Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method employing the use of a dictionary, e.g. LZ78
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3091Data deduplication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • G06F11/1453Management of the data involved in backup or backup restore using de-duplication of the data

Definitions

  • the present disclosure generally relates to data storage, and more specifically to data reduction in data storage and transmission devices.
  • the disclosure introduces data processing devices and methods for supporting similarity detection for compressed data, thereby improving the efficiency of these devices.
  • Data similarity (resemblance) detection is widely used in data storage, data transmission over a network, plagiarism detection, web search, etc. Similarity detection, when used in data storage and transmission devices, makes it possible to apply deduplication or delta-compression for similar data, thus further improving the efficiency of these devices.
  • chunking is used to divide data into chunks (blocks) of fixed or variable size.
  • chunking is defined by data content, it is called Content Defined Chunking (CDC).
  • CDC Content Defined Chunking
  • blocks are tested for similarity by comparing their features, where a feature of a data block is usually a number, e.g., a hash, calculated for the content of the whole data block or a part of the data block.
  • CDC Content Defined Chunking
  • Popular approaches to match features of processed data blocks and incoming data blocks include key-value stores (hash- tables), Bloom filters, and sorting. For instance, the feature of a respective block, and the block, or block ID, can be used to form a key-value pair. Then the key-value store is used to store the key-value pairs of processed blocks and to find blocks with the same keys, as those calculated for incoming data blocks.
  • a similar data block is found for an incoming data block, that similar data block can be used as a dictionary to delta-compress the incoming data block.
  • delta or difference only a small part of incoming data, so-called delta or difference, needs to be compressed and placed in data storage or transmitted over the network.
  • the delta or difference will be zero.
  • data deduplication rather than delta-compression, can be applied to the incoming data block.
  • Hashing effectively serves for the deduplication of fully equal data blocks after data blocks are compressed. For example, in data storage devices, an incoming data block is compressed first, then a hash value is calculated for the compressed data. This hash value is used as a key to look up in a hash table, where the hash table contains key-value pairs for already processed and compressed data blocks. In cases, when there may be many duplicates in incoming data, and/or data compression uses a lot of computations, the hash calculation and deduplication are applied to incoming data before its compression.
  • the already processed data must be also decompressed to use as the dictionary for delta-compression of not yet compressed incoming data.
  • the conventional partial similarity detection method which uses chunking and hashing before compression of incoming data, is efficient in terms of compression ratio, but considerably decreases latency and throughput of data storage and transmission devices since it requires decompression of already processed data both for similarity detection and for following deltacompression.
  • An objective of this disclosure is thus to introduce a special compression algorithm that enables partial similarity detection for compressed data. Another objective is to combine deduplication with partial similarity detection and delta-compression, i.e., all three approaches applied to incoming data after its compression by the special compression algorithm. Another objective is to delay the partial similarity detection and delta-compression, thus to spare computations for scenarios with a high churn of incoming data, when data is discarded (deleted) soon after it is sent to storage or transmitted.
  • a first aspect of the disclosure provides a data processing device being configured to: process a sequence of incoming data blocks to obtain a sequence of processed data blocks by using a compression algorithm that produces byte-aligned output, wherein each processed data block of the sequence is a compressed data block or an incompressible data block; accumulate the sequence of processed data blocks into an intermediate data structure; perform a similarity detection on each data block in the sequence of processed data blocks that is accumulated in the intermediate data structure, to detect whether one or more similar data blocks have been stored in or received by the data processing device; and if a similar data block is found, depending on a similarity degree perform a delta-compression or deduplication on a data block of the sequence of processed data blocks that corresponds to the similar data block.
  • This disclosure accordingly proposes a data processing approach that relies on specialized compression algorithms that can produce byte-aligned output for the following similarity detection and delta-compression or deduplication.
  • the compressor outputs data aligned on a byte boundary.
  • the intermediate data structure may be temporary storage such as a batch, or a queue, etc. It should be noted that this disclosure enables the similarity detection and also the following delta-compression or deduplication of the compressed data - without decompression, thereby eliminating the need to decompress both already stored or transmitted data and already compressed incoming data.
  • performing a delta-compression or deduplication on a data block comprises: performing the deltacompression on the data block, if the found similar data block is partially similar to that data block; and performing deduplication on the data block, if the found similar data block is fully equal to that data block.
  • Partially similar means that at least a part of the data block is identical to the found similar data block.
  • this disclosure proposes to delta-compress already compressed data.
  • the data block is identical to the found similar data block.
  • the identical data block may be deduplicated, for instance, it may be replaced by a reference to the found similar (identical) data block.
  • the sequence of incoming data blocks comprises at least one of, one or more incoming compressible data blocks and one or more incoming incompressible data blocks
  • the compression algorithm is a specialized similarity detection aware compression algorithm
  • processing the sequence of incoming data blocks comprises: compressing the one or more incoming compressible data blocks using the specialized similarity detection aware compression algorithm to obtain one or more incoming compressed data blocks and metadata associated with each of the one or more incoming compressed data blocks, and/or calculating metadata associated with each of the one or more incoming incompressible data blocks using the specialized similarity detection aware compression algorithm
  • the sequence of processed data blocks comprises at least one of the one or more incoming compressed data blocks and one or more incoming incompressible data blocks
  • metadata associated with each block of the sequence of the processed data blocks comprises the metadata associated with each incoming compressed data block, or the metadata associated with each incoming incompressible data block.
  • this disclosure proposes to process each incoming data block using a specialized compression algorithm that is aware of the following similarity detection and possible deltacompression.
  • the proposed specialized compression algorithm produces compressed data in a byte-aligned format and extra metadata.
  • the proposed specialized compression algorithm produces the extra metadata.
  • the data processing device is further configured to: accumulate the sequence of processed data blocks and the metadata associated with each block of the sequence of the processed data blocks into the intermediate data structure; and perform the similarity detection based on the metadata associated with each block of the sequence of the processed data blocks.
  • processing the sequence of incoming data blocks further comprises: detecting whether an incoming compressed data block or incoming incompressible data block of the sequence of processed data blocks is a duplicate of a previously processed data block; and replacing the detected duplicate by reference to the previously processed data block.
  • this disclosure proposes to combine deduplication with partial similarity detection and delta-compression. All three procedures are applied to incoming data after its compression by the specialized compression algorithm.
  • the data processing device is further configured to: maintain a first key-value store keeping information of previously processed data blocks, wherein the information of each previously processed data block comprises a key and a value associated with that previously processed data block, wherein the value comprises an identifier of that previously processed data block and at least a part of the metadata of the processed data block; and detect whether an incoming compressed data block or incoming incompressible data block of the sequence of processed data blocks is a duplicate by using the first key-value store.
  • This disclosure further proposes a particular key-value store (hash table), i.e., the first key -value store, used for finding possible duplicates for incoming compressed data blocks among already processed compressed data blocks.
  • a particular key-value store i.e., the first key -value store
  • the data processing device is further configured to: update the first key-value store with information of the incoming compressed data block or the incoming incompressible data block, when no duplicates are detected, by adding a key and a value associated with that incoming compressed data block or incoming incompressible data block to the first key-value store.
  • the metadata associated with each of the processed data block in the sequence comprises one or more of the following:
  • the one or more indications comprised in the metadata will be used for the following similarity detection.
  • the data processing device is further configured to: set a delay after accumulating the sequence of processed data blocks; and perform the similarity detection on the intermediate data structure after the delay expires.
  • data may be discarded (deleted) soon after it is sent to the storage or transmitted.
  • the delay allows to possibly decrease the number of blocks passed to the following similarity detection procedure.
  • the data processing device is further configured to: update the intermediate data structure if a processed data block of the sequence is discarded or deleted before the delay expires.
  • the data processing device when the sequence of processed data blocks comprises the one or more incoming compressed data blocks, is further configured to: sample each incoming compressed data block of the sequence to obtain a plurality of samples for that incoming compressed data block; calculate a hash for each of the plurality of samples using a hash function; and generate a combined key for each of the plurality of samples by combining the calculated hash for that sample with one of the indications comprised in the metadata associated with incoming compressed data block.
  • a taken sample comprises a number of sequential bytes of the compressed data block, not the bytes of incoming data block before compression.
  • a hash value is calculated.
  • the hash value calculated for a sample may be combined with one of the indications comprised in the metadata.
  • Such a combined key may be used for the following similarity detection.
  • the data processing device is further configured to: maintain a second key- value store keeping information about at least a part of previously obtained samples from a previously compressed data block, wherein the information of each previously obtained sample comprises at least one of: a combined key associated with that previously obtained sample, an identifier of the previously compressed data block that the previously obtained sample belongs to, one or more repeating offsets, and one or more hashes with the minimal values associated with that previously compressed data block; and wherein performing the similarity detection on each compressed block in the sequence of processed data blocks comprises determining whether a compressed data block of the sequence has a partial similarity with, or is fully equal to, a previously compressed data block based on the information stored in the second key-value store.
  • This disclosure further proposes another key-value store for (partial and full equality) similarity detection in compressed data, which is different than the first key -value store for deduplication of compressed data.
  • each of the first key-value store and the second key-value store may use different hash functions for respective key (hash) calculations.
  • the data processing device is further configured to: determine a most similar previously compressed data block for a particular incoming compressed data block based on the information stored in the second key-value store; and if the most similar previously compressed data block has a partial similarity with the particular incoming compressed data block, perform the delta-compression on that particular compressed data block using the most similar previously compressed data block as a dictionary without decompressing the most similar previously compressed data block and the particular incoming compressed data block; if the most similar previously compressed data block is fully equal to the incoming compressed data block, perform deduplication of the particular incoming compressed data block.
  • This disclosure proposes to delta-compress already compressed data, thereby eliminating the need to decompress both dictionaries (already stored or transmitted) and already compressed incoming data.
  • the data processing device is further configured to: update the second key-value store with information of obtained samples from a particular incoming compressed data block, when no similar previously compressed data block is found for that particular incoming compressed data block.
  • the second key-value stores are updated with information about the unique incoming compressed block, in particular with its samples and respective extra metadata.
  • each incoming compressed data block comprises incompressible data and compressed data
  • the data processing device is further configured to: sample the incoming compressed data block without decompressing it, when a length of the incompressible data at the beginning of the incoming compressed block is not below a first threshold, and/or when a size of the incoming compressed data block is not below a second threshold; or sample the incoming compressed data block without decompressing it, when a length of the incompressible data at the end of the incoming compressed block is not below a threshold and/or when a size of the incoming compressed data block is not below a second threshold; or sample the incoming compressed data block inside of the compressed data without decompressing it, when one or more repeating offsets are encoded by the specialized similarity detection aware compression algorithm; and/or when a size of the incoming compressed data block is not below a second threshold.
  • the plurality of samples for each incoming compressed data block comprises one or more of the following:
  • the data processing device when the sequence of processed data blocks comprises the one or more incoming incompressible data blocks, is further configured to: scan each incoming incompressible data block by the specialized similarity detection aware compression algorithm, to obtain a plurality of hashes for the incoming incompressible data block; determine one or more hashes with minimal values for the incoming incompressible data block among the plurality of hashes.
  • the data processing device is further configured to: maintain a third key-value store keeping information about at least a part of previously scanned incoming incompressible data block s, wherein the information of each previously scanned incoming incompressible data block comprises at least one of: one or more hashes with a minimal values, and an identifier of the previously scanned incoming incompressible data block; and wherein performing the similarity detection on the each incompressible data block in the sequence of processed data blocks in the intermediate data structure comprises determining whether an incompressible data block of the sequence has a partial similarity with, or is fully equal to a previously scanned incompressible data block based on the information stored in the third key-value store.
  • This disclosure further proposes a third key-value store used for similarity detection among non-compressible blocks. It should be noted that non-compressible blocks will not be sampled. The hashes with one or more minimal values (the lowest values hashes found by LZ-class compressor, and output as extra metadata) instead of samples, will be used as keys for lookup and updates of the third key -value store.
  • the data processing device is further configured to: determine a most similar previously scanned incompressible data block for a particular incoming incompressible data block based on the information stored in the third key-value store; and if the most similar previously scanned incompressible data block has a partial similarity with the particular incoming incompressible data block, perform the deltacompression on that particular incoming incompressible data block using the most similar previously scanned incompressible data block as a dictionary; if the most similar previously scanned incompressible data block is fully equal to the incoming incompressible block, perform deduplication of that particular incoming incompressible data block.
  • this disclosure proposes to delta-compress the incoming incompressible data block that a similar incompressible data block has been found, and deduplicate the incoming incompressible data block which is identical to a previously stored incompressible data block.
  • the data processing device is further configured to: update the third key -value store with information about a particular incoming incompressible data block, when no similar previously scanned incompressible data block is found for that particular incoming incompressible data block.
  • the third key-value store is updated with information about the unique incoming incompressible data block, in particular with its hashes with minimal values and respective extra metadata.
  • a second aspect of the disclosure provides a method performed by a data processing device for data processing, comprising: processing a sequence of incoming data blocks to obtain a sequence of processed data blocks by using a compression algorithm that produces byte-aligned output, wherein each processed data block of the sequence is a compressed data block or an incompressible data block; accumulating the sequence of processed data blocks into an intermediate data structure; performing a similarity detection on each data block in the sequence of processed data blocks that is accumulated in the intermediate data structure, to detect whether one or more similar data blocks have been stored in or received by the data processing device; and if a similar data block is found, depending on a similarity degree performing a deltacompression or deduplication on a data block in the sequence of processed data blocks that corresponds to the similar data block.
  • Implementation forms of the method of the second aspect may correspond to the implementation forms of the data processing device of the first aspect described above.
  • the method of the second aspect and its implementation forms achieve the same advantages and effects as described above for the data processing device of the first aspect and its implementation forms.
  • a third aspect of the disclosure provides a computer program or computer program product comprising a program code for carrying out, when implemented on a processor, the method according to the second aspect and any implementation forms of the second aspect.
  • a fourth aspect of the disclosure provides a computer-readable medium comprising instructions which, when executed by a computer, cause the computer to carry out the method according to the second aspect and any implementation forms of the second aspect.
  • FIG. 1 shows a data processing device according to an embodiment of the disclosure
  • FIG. 2 shows compressed data according to an embodiment of the disclosure
  • FIG. 3 shows compressed data according to an embodiment of the disclosure
  • FIG. 4 shows compressed data according to an embodiment of the disclosure
  • FIG. 5 shows a hash table n according to an embodiment of the disclosure
  • FIG. 6 shows a hash table according to an embodiment of the disclosure
  • FIG. 7 shows a flow chart according to an embodiment of the disclosure
  • FIG. 8 shows a similarity detection procedure according to an embodiment of the disclosure
  • FIG. 9 shows a method according to an embodiment of the disclosure.
  • an embodiment or example may refer to other embodiments or examples.
  • any description including but not limited to terminology, element, process, explanation, and/or technical advantage mentioned in one embodiment/example is applicative to the other embodiments or examples.
  • FIG. 1 shows a data processing device 100 according to an embodiment of the disclosure.
  • the data processing device 100 may comprise processing circuitry (not shown) configured to perform, conduct or initiate the various operations of the data processing device 100 described herein.
  • the processing circuitry may comprise hardware and software.
  • the hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry.
  • the digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors.
  • the data processing device 100 may further comprise memory circuitry, which stores one or more instruction(s) that can be executed by the processor or by the processing circuitry, in particular under the control of the software.
  • the memory circuitry may comprise a non-transitory storage medium storing executable software code which, when executed by the processor or the processing circuitry, causes the various operations of the data processing device 100 to be performed.
  • the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors.
  • the non-transitory memory may carry executable program code which, when executed by the one or more processors, causes data processing device 100 to perform, conduct or initiate the operations or methods described herein.
  • the data processing device 100 is configured to process a sequence of incoming data blocks 101 to obtain a sequence of processed data blocks 102 by using a compression algorithm that produces byte-aligned output.
  • each processed data block of the sequence is a compressed data block or an incompressible data block.
  • the data processing device 100 is further configured to accumulate the sequence of processed data blocks 102 into an intermediate data structure 103.
  • the intermediate data structure 103 may be temporary storage such as a batch, or a queue, etc.
  • the data processing device 100 is configured to perform a similarity detection on each data block in the sequence of processed data blocks 102 that is accumulated in the intermediate data structure 103, to detect whether one or more similar data blocks have been stored in or received by the data processing device 100. If a similar data block is found, depending on a similarity degree, the data processing device 100 is further configured to perform a delta-compression or deduplication on a data block in the sequence of processed data blocks 102 that corresponds to the similar data block.
  • This disclosure proposes a data processing approach that relies on specialized compression algorithms that can produce byte-aligned output for the following similarity detection and deltacompression or deduplication.
  • the compressor outputs data aligned on a byte boundary.
  • whether the data processing device 100 performs a delta-compression or deduplication on a data block depends on a result of the similarity detection, such as a similarity degree if a similar data block is found.
  • the data processing device 100 is configured to perform the delta-compression on the data block, if the found similar data block is partially similar to that data block. “Partially similar” means that at least a part of the data block is identical to the found similar data block.
  • the data processing device 100 is also configured to perform deduplication on the data block, if the found similar data block is fully equal to that data block. In the case that the found similar data block is fully equal to a particular data block, that particular data block is identical to the found similar data block.
  • the sequence of incoming data blocks 101 may comprise at least one of one or more incoming compressible data blocks and one or more incoming incompressible data blocks. That is, each incoming data block may be a compressible data block or an incompressible data block.
  • the compression algorithm may be a specialized similarity detection aware compression algorithm.
  • processing the sequence of incoming data blocks 101 may comprise compressing the one or more incoming compressible data blocks using the specialized similarity detection aware compression algorithm to obtain one or more incoming compressed data blocks and metadata associated with each of the one or more incoming compressed data blocks.
  • processing the sequence of incoming data blocks 101 may comprise calculating metadata associated with each of the one or more incoming incompressible data blocks using the specialized similarity detection aware compression algorithm.
  • the sequence of processed data blocks 102 comprises at least one of the one or more incoming compressed data blocks and one or more incoming incompressible data blocks
  • metadata associated with each block of the sequence of the processed data blocks comprises the metadata associated with each incoming compressed data block, or the metadata associated with each incoming incompressible data block.
  • This disclosure proposes to process each incoming data block by a byte-aligned compressed format, in particular, a specialized compression LZ-class algorithm that is aware of the following similarity detection (SD) and possible Delta-compression (DC).
  • SD similarity detection
  • DC Delta-compression
  • SDDC-aware compression may use a dictionary-matching such as in LZ77, does not combine it with an entropy coding, and produce compressed data in byte-aligned format.
  • the resulting compressed data is a finite ordered sequence of encoded elements.
  • FIG. 2 shows an example of compressed data after the SDDC-aware compression according to an embodiment of the disclosure. It is named “LZ compressed data” in the figure.
  • Each element in the sequence referred to as “lz_tuple” in this example, may start with a 24 bits tag, further referred to as “lz_tag”.
  • the lz_tag may comprise or encode the following entities:
  • Other 12 bits of lz_tag are used to encode lengths of repeats and non-repeats, as described below.
  • the number of bits to encode non-repeat length in lz_tag is further referred to as “non-repeat size” or “lz_non_repeat_bits”.
  • the lz_tag may be followed by extended lengths and then by unique bytes, as in an open-source LZ4 algorithm. There may be encoded elements with 0 unique bytes. It should be noted that the first element of the compressed sequence always encodes at least one unique byte. The last element of compressed sequence may or may not encode unique bytes.
  • the input data is: “ABRACADABRA”.
  • the compression algorithm detects repeating sub-string “ABRA” in the input string “ABRACADABRA”. Therefore, from the compression algorithm point of view, the input string comprises a non-repeat sub-string (unique sub-string of 7 letters) “ABRACAD”, followed by a repeat sub-string “ABRA” (duplicate of the first 4 letters).
  • the offset of the repeat “ABRA” in the input string is 0.
  • Encoded data comprises 24 bits Iz tag followed by non-repeat “ABRACAD”.
  • This disclosure also proposes to collect extra metadata for each incoming data block during the compression procedure, using the SDDC-aware compression algorithm.
  • the extra metadata is not a part of the compressed data, and the extra metadata is not needed for decompression.
  • the extra metadata is used for facilitating the following partial similarity detection and possible delta compression in the compressed data. It may be worth mentioning that according to this disclosure, the already processed compressed data does not need to be decompressed for the partial similarity comparison and the following delta-compression.
  • the extra metadata that is produced by the SDDC-aware compression is discussed as follows.
  • the metadata associated with each of the processed data blocks in the sequence comprises one or more of the following:
  • the position where the incompressible data starts in the processed data may refer to for example N bytes back from the last byte of the compressed data block.
  • hash values produced by the compression algorithm may include: 212, 145, 98, 666, 32567, and 178. Among these values, there are 2 hashes with minimal values: 98 (absolute min), and 145 (next min after absolute one). According to this disclosure, two or more numbers of hashes with minimal values may be indicated by the extra metadata.
  • the metadata associated with each of the processed data blocks in the sequence may also comprise an indication about one or more most popular hash values, which are calculated during compression.
  • the most popular hash value means that the hash value repeats more frequently during hash calculation.
  • LZ-compressor takes input data block, it scans input data sequentially to find repeating “words”.
  • a “word” is 2/3/4 bytes.
  • LZ-scanning is done by incrementing some index/pointer into the input data block and taking the next word. Assuming input is “ABRACADABRA” and the word size is 3 bytes. Then LZ-compressor takes the following words for analysis: ABR, BRA, RAC, ACA, CAD, etc.
  • LZ-compressor usually performs hash calculation for each scanned word, then uses some sort of hash table, binary tree, etc. to lookup with this hash and to find candidate words for a possible match (word equality).
  • a simple hash function may be used in LZ-class algorithms, therefore many hash collisions may happen.
  • Hash collision happens when the same hash value, e.g., 2347, is calculated for different words, e.g. ABR, and CAD.
  • Hash collision implies repeating hash. Hashes that repeat (collide) most frequently are output by the compressor as extra metadata. It should be noted that hash collision (h repeating) happens for any data - either compressible or non-compressible.
  • hashes with the lowest value, or with the highest value may be considered for example, instead of more frequent hashes and less frequent hashes.
  • the extra metadata may comprise, but is not limited to:
  • bit or bits may be further referred to as “head_bits” of the incoming block.
  • the bit or bits may be further referred to as “tail_bits” of the incoming block.
  • hash values are further referred to as popular_hashes of the incoming block.
  • the extra metadata for the example string “ABRACADABRA” may include:
  • head_bits 1 bit for the 1st (and only) element of the compressed sequence, in this case, the head_bits has value 0;
  • tail_bits 1 bit for the last element, in this case, the tail_bits has value 0;
  • the data processing device may be further configured to accumulate the sequence of processed data blocks 102 and the metadata associated with each block of the sequence of the processed data blocks into the intermediate data structure 103. Then the data processing device may be further configured to perform the similarity detection based on the metadata associated with each block of the sequence of the processed data blocks.
  • processing the sequence of incoming data blocks 101 further comprises detecting whether an incoming compressed data block or incoming incompressible data block of the sequence of processed data blocks 102 is a duplicate of a previously processed data block, and then replacing the detected duplicate by reference to the previously processed data block.
  • This disclosure further proposes a key-value store structure (or a hash-table) used for finding possible duplicates for incoming compressed data blocks among already processed compressed data blocks.
  • the data processing device 100 may be further configured to maintain a first keyvalue store keeping information of previously processed data blocks.
  • the information of each previously processed data block comprises a key and a value associated with that previously processed data block.
  • the value comprises an identifier of that previously processed data block and at least a part of the metadata of the processed data block.
  • the data processing device 100 may be further configured to detect whether an incoming compressed data block or incoming incompressible data block of the sequence of processed data blocks 102 is a duplicate by using the first key-value store.
  • the data processing device 100 may be further configured to update the first keyvalue store with information of the incoming compressed data block or the incoming incompressible data block, when no duplicates are detected, by adding a key and a value associated with that incoming compressed data block or incoming incompressible data block to the first key-value store.
  • This disclosure further proposes that the deduplicated incoming compressed data blocks, together with respective extra metadata, which are accumulated in the intermediate data structure 103, will be passed with a delay to the following similarity detection procedure.
  • the data processing device 100 may be further configured to set a delay after accumulating the sequence of processed data blocks. Then, the data processing device 100 may be configured to perform the similarity detection on the intermediate data structure 103 after the delay expires.
  • data may be discarded (deleted) soon after it is sent to storage or transmitted.
  • the delay allows to possibly decrease the number of blocks passed to the following similarity detection procedure.
  • the data processing device 100 may be further configured to update the intermediate data structure 103 if a processed data block of the sequence is discarded or deleted before the delay expires.
  • the data processing device when the sequence of processed data blocks 102 comprises the one or more incoming compressed data blocks, is further configured to sample each incoming compressed data block of the sequence to obtain a plurality of samples for that incoming compressed data block. For each incoming compressed data block, without its decompression, several fixed-size samples may be taken to determine its possible similarity with already processed compressed data blocks. Notably, a taken sample comprises a number of sequential bytes of the compressed data block, not the bytes of incoming data block before compression.
  • the sample size (number of bytes in sample) is further referred to as “sample_size”.
  • samples will not be taken, and the similarity is not detected, if the size of the compressed data is less than a threshold, i.e., an adaptively selected similarity detection threshold, further referred to as “sd_threshold”.
  • a threshold i.e., an adaptively selected similarity detection threshold, further referred to as “sd_threshold”.
  • sd_threshold an adaptively selected similarity detection threshold
  • These samples may include:
  • FIG. 3 shows an example of where the head_sample will be taken.
  • tail_sample a sample at the end of the compressed incoming block. This sample is further referred to as “tail_sample”. If the incoming block is non-compressible then no tail_sample is taken. FIG. 3 shows an example of where the tail_sample will be taken.
  • each lz_tag encodes one of the popular_offsets.
  • the respective lz_tag should be neither the first nor the last in the encoded sequence.
  • each incoming compressed data block comprises incompressible data and compressed data
  • the data processing device is further configured to sample the incoming compressed data block without decompressing it, when a length of the incompressible data at the beginning of the incoming compressed block is not below a first threshold, and/or when a size of the incoming compressed data block is not below a second threshold.
  • the data processing device may be further configured to sample the incoming compressed data block without decompressing it, when a length of the incompressible data at the end of the incoming compressed block is not below a threshold and/or when a size of the incoming compressed data block is not below a second threshold.
  • the data processing device may be further configured to sample the incoming compressed data block inside of the compressed data without decompressing it, when one or more repeating offsets are encoded by the specialized similarity detection aware compression algorithm; and/or when a size of the incoming compressed data block is not below a second threshold.
  • the plurality of samples for each incoming compressed data block may comprise one or more of the following:
  • the data processing device may be further configured to calculate a hash for each of the plurality of samples using a hash function.
  • a hash value is calculated for each taken sample. It should be noted that the hash value calculated for the sample differs from popular_hashes found by the compression algorithm while compressing incoming blocks. According to this disclosure, either the same hash function or different hash functions can be used for different samples, i.e., head_sample, tail_sample, and body_sample hashing.
  • the data processing device may be further configured to generate a combined key for each of the plurality of samples by combining the calculated hash for that sample with one of the indications comprised in the metadata associated with the incoming compressed data block.
  • the hash value for head_sample may be combined with the indication about the first element of the compressed sequence, i.e., the head_bits.
  • the hash value for tail_sample may be combined with the indication about the last element of the compressed sequence, i.e., the tail_bits.
  • the hash value for body_sample may be combined with the indication about the one or more repeating offsets in the compressed data, i.e., respective popular_offset bits.
  • the hash values with head and tail bits are combined using bit-wise OR.
  • tail_sample_hash (crc32(tail_sample) « 1)
  • the hash value is combined with popular_offset bits using XOR.
  • the data processing device is further configured to maintain a second key-value store keeping information about at least a part of previously obtained samples from a previously compressed data block.
  • the information of each previously obtained sample comprises at least one of: a combined key associated with that previously obtained sample, an identifier of the previously compressed data block that the previously obtained sample belongs to, one or more repeating offsets, and one or more hashes with the minimal values associated with that previously compressed data block.
  • performing the similarity detection on each compressed block in the sequence of processed data blocks 102 may comprise determining whether a compressed data block of the sequence has a partial similarity with, or is fully equal to, a previously compressed data block based on the information stored in the second key-value store.
  • Each calculated hash for a respective sample of a compressed incoming block is used to look up into a key-value store, i.e., the second key-value store.
  • the second key-value store may be implemented as a hash-table, where the key may be the hash of sample, and the value may be a pair of processed block ID and popular_hashes found by the compression algorithm in the course of compression of the said block.
  • the value may be denoted as ⁇ blockID, popular_hashes>.
  • the key-value is denoted as ⁇ head/tail/body sample hash, ⁇ block ID, popular_hashes».
  • the design of the second key-value store allows to efficiently find one or more processed compressed blocks, containing samples with the same hash, to be possibly used as a dictionary for delta-compression of the compressed incoming block.
  • one key-value store may be used for all hash lookups, i.e., head_sample_hash, tail sample hash, body_sample_hashes.
  • separate key-value stores may be used, i.e., one key-value store for look-ups using head_sample_hash, one key-value store for lookups using tail_sample_hash, and another key-value store for look-ups using body_sample_hash.
  • FIG. 5 shows an example of a hash table for similarity detection according to an embodiment of the disclosure.
  • the hash table comprises arrays of cell buckets, where columns are indexed by hash values.
  • a hash is calculated to produce a scalar value in the range HASH_MIN to HASH MAX.
  • the hash value is used as an index in the hash table.
  • a Hash table may be implemented as an array of cell buckets. There are many implementations of the hash table. This disclosure does not limit the choice of the implementations of the hash table. For simplicity of explanation hash table comprising an array of buckets is used here.
  • Each cell in the hash table contains blockID of the compressed data block that is either stored or transmitted, and ID of a sample (head, tail, body).
  • the respective sample has a hash value equal to the index of the cell bucket.
  • each cell may contain zero or more most popular (frequent) hashes found by the compressor in the course of respective (blockID) block compression.
  • FIG. 6 shows an implementation of the hash table for similarity detection.
  • HashTablel [buckets_max][cells_max], and HashTable2
  • Buckets_max][cells_max] may contain buckets or columns of cells.
  • a bucket is indexed by the hash value of the compressed block sample, and the bucket cell contains blockID, samplelD, and possibly other information.
  • the data processing device 100 is further configured to determine a most similar previously compressed data block for a particular incoming compressed data block based on the information stored in the second key-value store. If the most similar previously compressed data block has a partial similarity with the particular incoming compressed data block, the data processing device 100 is further configured to perform the delta-compression on that particular compressed data block using the most similar previously compressed data block as a dictionary without decompressing the most similar previously compressed data block and the particular incoming compressed data block. If the most similar previously compressed data block is fully equal to the incoming compressed data block, the data processing device 100 is further configured to perform deduplication of the particular incoming compressed data block.
  • ⁇ blockID, popular_hashes> values are sorted to count occurred blockIDs.
  • Each blockID is associated with its occurrence count.
  • the block ID with maximum occurrence count is selected as a candidate for a dictionary to delta-compress incoming compressed block.
  • the best match of popular_hashes is determined by comparing values of popular_hashes for incoming blocks and popular_hashes associated with each of found processed blocks. If the most similar processed compressed block is found, then it is used as a dictionary for deltacompression of the compressed incoming block.
  • the incoming compressed block may be a complete duplicate of found similar processed compressed block. In this case, the incoming compressed block is deduplicated.
  • the data processing device 100 is further configured to update the second key-value store with information of obtained samples from a particular incoming compressed data block, when no similar previously compressed data block is found for that particular incoming compressed data block.
  • Key-value stores are updated with information about the unique incoming compressed block, in particular with its samples and respective extra metadata. After this update, the incoming compressed block may be found similar to other incoming compressed blocks, and therefore may be used as a dictionary for delta-compression.
  • the data processing device when the sequence of processed data blocks 102 comprises the one or more incoming incompressible data blocks, is further configured to scan each incoming incompressible data block by the specialized similarity detection aware compression algorithm, to obtain a plurality of hashes for the incoming incompressible data block; determine one or more hashes with minimal values for the incoming incompressible data block among the plurality of hashes.
  • an LZ-compressor when an LZ-compressor takes input data block, it scans input data sequentially to find repeating “words”. Usually, a “word” is 2/3/4 bytes. LZ-scanning is done by incrementing some index/pointer into the input data block and taking the next word. Assuming input is “ABRACADABRA” and the word size is 3 bytes. Then LZ-compressor takes the following words for analysis: ABR, BRA, RAC, ACA, CAD, etc.
  • the LZ-compressor usually performs hash calculation for each scanned word, then uses some sort of hash table, binary tree, etc. to find one or more hashes with minimal values. Therefore, hashes with the lowest values, or with the highest values may be easily found.
  • the data processing device 100 may be further configured to maintain a third key-value store keeping information about at least a part of previously scanned incoming incompressible data blocks.
  • the information of each previously scanned incoming incompressible data block comprises at least one of: one or more hashes with one or more minimal values, and an identifier of the previously scanned incoming incompressible data block.
  • performing the similarity detection on each incompressible data block in the sequence of processed data blocks 102 in the intermediate data structure comprises determining whether an incompressible data block of the sequence has a partial similarity with, or is fully equal to a previously scanned incompressible data block based on the information stored in the third key-value store.
  • This disclosure further proposes a third key-value store used for similarity detection among non-compressible blocks. It should be noted that non-compressible blocks will not be sampled. The hashes with one or more minimal values (the lowest values hashes found by LZ-class compressor, and output as extra metadata) instead of samples, will be used as keys for lookup and updates of the third key-value store.
  • the data processing device 100 may be further configured to determine a most similar previously scanned incompressible data block for a particular incoming incompressible data block based on the information stored in the third keyvalue store. If the most similar previously scanned incompressible data block has a partial similarity with the particular incoming incompressible data block, the data processing device 100 may be further configured to perform the delta-compression on that particular incoming incompressible data block using the most similar previously scanned incompressible data block as a dictionary. If the most similar previously scanned incompressible data block is fully equal to the incoming incompressible block, the data processing device 100 may be further configured to perform deduplication of that particular incoming incompressible data block.
  • the data processing device 100 may be further configured to update the third keyvalue store with information about a particular incoming incompressible data block, when no similar previously scanned incompressible data block is found for that particular incoming incompressible data block.
  • FIG. 7 shows an overall procedure of the proposed device according to an embodiment of the disclosure.
  • the proposed apparatus and method involve two stages. At the first stage, incoming data blocks are compressed and then deduplicated in a compressed state. During compression of the incoming data block, extra metadata is collected, which is neither needed for decompression nor deduplication but rather will be used later for similarity detection at the second stage. Deduplicated incoming compressed data blocks, together with respective extra metadata, are accumulated in a batch, a queue, or a similar data structure, to be passed with a delay to the second stage. In use-cases with high data chum, when data is discarded (deleted) soon after it is sent to storage or transmitted, the delay allows to possibly decrease the number of blocks passed to the second stage.
  • the second stage can be done in a separate (concurrent) process. It may accept a batch of compressed or non-compressible incoming data blocks together with respective extra metadata, accumulated in the 1 st stage. Using extra metadata similarity of the incoming compressed block to processed compressed blocks is determined by using sampling, hashing, and key-values (hash-tables) lookup. The most similar processed compressed block, if any found, is selected as a dictionary for delta-compression of an incoming compressed block. The delta-compressed incoming block is then stored or transmitted to its destination.
  • the incoming compressed block is considered unique.
  • Key-value stores (hash-tables) are updated with information about the incoming compressed blocks, in particular with their samples and respective extra metadata. After this update, the incoming compressed block may be found similar to other incoming compressed blocks, and therefore may be used as a dictionary for delta-compression. The unique incoming compressed block is then stored or transmitted to its destination.
  • the first stage uses a separate key-value store (hash-table) to find a possible duplicate for incoming compressed data block among already processed compressed data blocks. If no duplicates are found at the second stage, a separate key-value store is updated with a hash calculated for incoming compressed block combined with its blockID. After this update, the incoming compressed block may be found to be duplicated to other incoming compressed blocks. Further, compressed or non-compressible incoming data blocks, together with extra metadata associated with each incoming data block, are accumulated in a batch (queue or any other proper data structure) to be passed to the second stage of the proposed apparatus and method.
  • a separate key-value store (hash-table) to find a possible duplicate for incoming compressed data block among already processed compressed data blocks. If no duplicates are found at the second stage, a separate key-value store is updated with a hash calculated for incoming compressed block combined with its blockID. After this update, the incoming compressed block may be found to be duplicated to other incoming compressed blocks. Further, compressed or non-
  • the accumulation and respective delay in processing allow sparing computations of similarity detection and delta-compression for incoming blocks that would be deleted due to high chum.
  • the second stage can be done in a separate (concurrent) process. As input, it accepts a batch of compressed or non-compressible incoming data blocks together with respective extra metadata, accumulated in the first stage.
  • the main steps of the second stage include:
  • the incoming data block is not delta-compressed.
  • Keyvalue stores (hash-tables) are updated with information about the incoming compressed blocks, in particular with their samples and respective extra metadata. After this update, the incoming compressed block may be found similar to other incoming compressed blocks, and therefore may be used as a dictionary for delta-compression.
  • FIG. 8 shows an example of hash table updating after an incoming block according to an embodiment of this disclosure.
  • hashes hl, h2, h3, and h4 for samples taken from the incoming block bl 0 are obtained.
  • a most similar previously compressed data block b2 is found.
  • delta-compression will be performed on blO using b2 as a dictionary without decompressing b2 and b 10.
  • the hash table will be updated with a result of the delta-compression on b 10.
  • b2 is fully equal to the incoming block blO, deduplication will be performed, e.g., blO may be replaced by a reference to the previously processed data block b2.
  • the hash table will be updated with information of all obtained samples from blO.
  • FIG. 9 shows a method 900 according to an embodiment of the disclosure, particularly for groupcast transmission.
  • the method 900 is performed by a data processing device 100 shown in FIG. 1.
  • the method 900 comprises a step 901 of processing a sequence of incoming data blocks 101 to obtain a sequence of processed data blocks 102 by using a compression algorithm that produces byte-aligned output.
  • Each processed data block of the sequence is a compressed data block or an incompressible data block. ;
  • the method 900 comprises a step 902 of accumulating the sequence of processed data blocks 102 into an intermediate data structure, a step 903 of performing a similarity detection on each data block in the sequence of processed data blocks 102 that is accumulated in the intermediate data structure 103, to detect whether one or more similar data blocks have been stored in or received by the data processing device. If a similar data block is found, depending on a similarity degree, the method 900 further comprises a step 904 of performing a deltacompression or deduplication on a data block in the sequence of processed data blocks 102 that corresponds to the similar data block.
  • any method according to embodiments of the disclosure may be implemented in a computer program, having code means, which when run by processing means causes the processing means to execute the steps of the method.
  • the computer program is included in a computer-readable medium of a computer program product.
  • the computer-readable medium may comprise essentially any memory, such as a ROM (Read-Only Memory), a PROM (Programmable Read-Only Memory), an EPROM (Erasable PROM), a Flash memory, an EEPROM (Electrically Erasable PROM), or a hard disk drive.
  • embodiments of the data processing device 100 comprise the necessary communication capabilities in the form of e.g., functions, means, units, elements, etc., for performing the solution.
  • means, units, elements, and functions are: processors, memory, buffers, control logic, encoders, decoders, rate matchers, de-rate matchers, mapping units, multipliers, decision units, selecting units, switches, interleavers, de-interleavers, modulators, demodulators, inputs, outputs, antennas, amplifiers, receiver units, transmitter units, DSPs, trellis-coded modulation (TCM) encoder, TCM decoder, power supply units, power feeders, communication interfaces, communication protocols, etc. which are suitably arranged together for performing the solution.
  • TCM trellis-coded modulation
  • the processor(s) of the data processing device 100 may comprise, e.g., one or more instances of a Central Processing Unit (CPU), a processing unit, a processing circuit, a processor, an Application Specific Integrated Circuit (ASIC), a microprocessor, or other processing logic that may interpret and execute instructions.
  • the expression “processor” may thus represent a processing circuitry comprising a plurality of processing circuits, such as, e.g., any, some or all of the ones mentioned above.
  • the processing circuitry may further perform data processing functions for inputting, outputting, and processing of data comprising data buffering and device control functions, such as call processing control, user interface control, or the like.

Abstract

The present disclosure relates to devices and methods for data processing. The disclosure proposes a data processing device being configured to: process a sequence of incoming data blocks to obtain a sequence of processed data blocks by using a compression algorithm that produces byte-aligned output, each processed data block being a compressed data block or an incompressible data block; accumulate the sequence of processed data blocks into an intermediate data structure; perform a similarity detection on each data block in the sequence that is accumulated in the intermediate data structure, to detect whether one or more similar data blocks have been stored in or received by the data processing device; and if a similar data block is found, depending on a similarity degree perform a delta-compression or deduplication on a data block in the sequence that corresponds to the similar data block.

Description

DEVICE AND METHOD FOR SIMILARITY DETECTION OF COMPRESSED DATA
TECHNICAL FIELD
The present disclosure generally relates to data storage, and more specifically to data reduction in data storage and transmission devices. The disclosure introduces data processing devices and methods for supporting similarity detection for compressed data, thereby improving the efficiency of these devices.
BACKGROUND
Data similarity (resemblance) detection is widely used in data storage, data transmission over a network, plagiarism detection, web search, etc. Similarity detection, when used in data storage and transmission devices, makes it possible to apply deduplication or delta-compression for similar data, thus further improving the efficiency of these devices.
To detect similarity, so-called “chunking” is used to divide data into chunks (blocks) of fixed or variable size. When chunking is defined by data content, it is called Content Defined Chunking (CDC). After chunking, blocks are tested for similarity by comparing their features, where a feature of a data block is usually a number, e.g., a hash, calculated for the content of the whole data block or a part of the data block. The latter is also done by CDC. By definition of CDC, data blocks with the same features may have similar content. The more features of two data blocks are the same, the bigger is the chance that the contents of these blocks are similar.
Popular approaches to match features of processed data blocks and incoming data blocks include key-value stores (hash- tables), Bloom filters, and sorting. For instance, the feature of a respective block, and the block, or block ID, can be used to form a key-value pair. Then the key-value store is used to store the key-value pairs of processed blocks and to find blocks with the same keys, as those calculated for incoming data blocks.
If a similar data block is found for an incoming data block, that similar data block can be used as a dictionary to delta-compress the incoming data block. In this way, only a small part of incoming data, so-called delta or difference, needs to be compressed and placed in data storage or transmitted over the network. When previously processed data block and incoming data block are fully equal, the delta or difference will be zero. In this case, data deduplication, rather than delta-compression, can be applied to the incoming data block.
Hashing effectively serves for the deduplication of fully equal data blocks after data blocks are compressed. For example, in data storage devices, an incoming data block is compressed first, then a hash value is calculated for the compressed data. This hash value is used as a key to look up in a hash table, where the hash table contains key-value pairs for already processed and compressed data blocks. In cases, when there may be many duplicates in incoming data, and/or data compression uses a lot of computations, the hash calculation and deduplication are applied to incoming data before its compression.
Except for the deduplication, all known partial similarity detection methods (which is followed by delta-compression), are applied to incoming data before its compression. Likewise, the deltacompression is applied to uncompressed incoming data. Therefore, already processed compressed data must be decompressed for partial similarity comparison.
If and when partial similarity is detected between already processed data block (chunk) and incoming data block (chunk), the already processed data must be also decompressed to use as the dictionary for delta-compression of not yet compressed incoming data. As result, the conventional partial similarity detection method, which uses chunking and hashing before compression of incoming data, is efficient in terms of compression ratio, but considerably decreases latency and throughput of data storage and transmission devices since it requires decompression of already processed data both for similarity detection and for following deltacompression.
SUMMARY
An objective of this disclosure is thus to introduce a special compression algorithm that enables partial similarity detection for compressed data. Another objective is to combine deduplication with partial similarity detection and delta-compression, i.e., all three approaches applied to incoming data after its compression by the special compression algorithm. Another objective is to delay the partial similarity detection and delta-compression, thus to spare computations for scenarios with a high churn of incoming data, when data is discarded (deleted) soon after it is sent to storage or transmitted. These and other objectives are achieved by the solution of the present disclosure as provided in the independent claims. Advantageous implementations are further defined in the dependent claims.
A first aspect of the disclosure provides a data processing device being configured to: process a sequence of incoming data blocks to obtain a sequence of processed data blocks by using a compression algorithm that produces byte-aligned output, wherein each processed data block of the sequence is a compressed data block or an incompressible data block; accumulate the sequence of processed data blocks into an intermediate data structure; perform a similarity detection on each data block in the sequence of processed data blocks that is accumulated in the intermediate data structure, to detect whether one or more similar data blocks have been stored in or received by the data processing device; and if a similar data block is found, depending on a similarity degree perform a delta-compression or deduplication on a data block of the sequence of processed data blocks that corresponds to the similar data block.
This disclosure accordingly proposes a data processing approach that relies on specialized compression algorithms that can produce byte-aligned output for the following similarity detection and delta-compression or deduplication. By using such specialized compression algorithms, the compressor outputs data aligned on a byte boundary. The intermediate data structure may be temporary storage such as a batch, or a queue, etc. It should be noted that this disclosure enables the similarity detection and also the following delta-compression or deduplication of the compressed data - without decompression, thereby eliminating the need to decompress both already stored or transmitted data and already compressed incoming data.
In an implementation form of the first aspect, depending on a similarity degree performing a delta-compression or deduplication on a data block comprises: performing the deltacompression on the data block, if the found similar data block is partially similar to that data block; and performing deduplication on the data block, if the found similar data block is fully equal to that data block.
“Partially similar” means that at least a part of the data block is identical to the found similar data block. In the first case, this disclosure proposes to delta-compress already compressed data. In the second case, the data block is identical to the found similar data block. The identical data block may be deduplicated, for instance, it may be replaced by a reference to the found similar (identical) data block.
In an implementation form of the first aspect, the sequence of incoming data blocks comprises at least one of, one or more incoming compressible data blocks and one or more incoming incompressible data blocks, and the compression algorithm is a specialized similarity detection aware compression algorithm, wherein processing the sequence of incoming data blocks comprises: compressing the one or more incoming compressible data blocks using the specialized similarity detection aware compression algorithm to obtain one or more incoming compressed data blocks and metadata associated with each of the one or more incoming compressed data blocks, and/or calculating metadata associated with each of the one or more incoming incompressible data blocks using the specialized similarity detection aware compression algorithm, wherein the sequence of processed data blocks comprises at least one of the one or more incoming compressed data blocks and one or more incoming incompressible data blocks, and metadata associated with each block of the sequence of the processed data blocks comprises the metadata associated with each incoming compressed data block, or the metadata associated with each incoming incompressible data block.
Optionally, this disclosure proposes to process each incoming data block using a specialized compression algorithm that is aware of the following similarity detection and possible deltacompression. For compressible data blocks, the proposed specialized compression algorithm produces compressed data in a byte-aligned format and extra metadata. For incompressible data blocks, the proposed specialized compression algorithm produces the extra metadata.
In an implementation form of the first aspect, the data processing device is further configured to: accumulate the sequence of processed data blocks and the metadata associated with each block of the sequence of the processed data blocks into the intermediate data structure; and perform the similarity detection based on the metadata associated with each block of the sequence of the processed data blocks.
Notably, the extra metadata is not a part of the compressed data and is not needed for decompression. Instead, it facilitates following partial similarity detection and possible delta compression. In an implementation form of the first aspect, processing the sequence of incoming data blocks further comprises: detecting whether an incoming compressed data block or incoming incompressible data block of the sequence of processed data blocks is a duplicate of a previously processed data block; and replacing the detected duplicate by reference to the previously processed data block.
Optionally, this disclosure proposes to combine deduplication with partial similarity detection and delta-compression. All three procedures are applied to incoming data after its compression by the specialized compression algorithm.
In an implementation form of the first aspect, the data processing device is further configured to: maintain a first key-value store keeping information of previously processed data blocks, wherein the information of each previously processed data block comprises a key and a value associated with that previously processed data block, wherein the value comprises an identifier of that previously processed data block and at least a part of the metadata of the processed data block; and detect whether an incoming compressed data block or incoming incompressible data block of the sequence of processed data blocks is a duplicate by using the first key-value store.
This disclosure further proposes a particular key-value store (hash table), i.e., the first key -value store, used for finding possible duplicates for incoming compressed data blocks among already processed compressed data blocks.
In an implementation form of the first aspect, the data processing device is further configured to: update the first key-value store with information of the incoming compressed data block or the incoming incompressible data block, when no duplicates are detected, by adding a key and a value associated with that incoming compressed data block or incoming incompressible data block to the first key-value store.
In an implementation form of the first aspect, the metadata associated with each of the processed data block in the sequence comprises one or more of the following:
- an indication about whether there is incompressible data at the beginning of the processed data block, and a length of the incompressible data, - an indication about whether there is incompressible data at the end of the processed data block, a length of the incompressible data, and a position where the incompressible data starts in the processed data block,
- an indication about whether there are one or more repeating offsets, generated by using the specialized similarity detection aware compression algorithm, in the processed data block, and values of the one or more repeating offsets,
- an indication about whether there are one or more hashes with minimal values among hashes generated by the specialized similarity detection aware compression algorithm, and the one or more hashes with minimal values.
The one or more indications comprised in the metadata, will be used for the following similarity detection.
In an implementation form of the first aspect, the data processing device is further configured to: set a delay after accumulating the sequence of processed data blocks; and perform the similarity detection on the intermediate data structure after the delay expires.
Notably, in use-cases with high data chum, data may be discarded (deleted) soon after it is sent to the storage or transmitted. The delay allows to possibly decrease the number of blocks passed to the following similarity detection procedure.
In an implementation form of the first aspect, the data processing device is further configured to: update the intermediate data structure if a processed data block of the sequence is discarded or deleted before the delay expires.
In an implementation form of the first aspect, when the sequence of processed data blocks comprises the one or more incoming compressed data blocks, the data processing device is further configured to: sample each incoming compressed data block of the sequence to obtain a plurality of samples for that incoming compressed data block; calculate a hash for each of the plurality of samples using a hash function; and generate a combined key for each of the plurality of samples by combining the calculated hash for that sample with one of the indications comprised in the metadata associated with incoming compressed data block. Notably, a taken sample comprises a number of sequential bytes of the compressed data block, not the bytes of incoming data block before compression. For each taken sample, a hash value is calculated. For instance, the hash value calculated for a sample may be combined with one of the indications comprised in the metadata. Such a combined key may be used for the following similarity detection.
In an implementation form of the first aspect, the data processing device is further configured to: maintain a second key- value store keeping information about at least a part of previously obtained samples from a previously compressed data block, wherein the information of each previously obtained sample comprises at least one of: a combined key associated with that previously obtained sample, an identifier of the previously compressed data block that the previously obtained sample belongs to, one or more repeating offsets, and one or more hashes with the minimal values associated with that previously compressed data block; and wherein performing the similarity detection on each compressed block in the sequence of processed data blocks comprises determining whether a compressed data block of the sequence has a partial similarity with, or is fully equal to, a previously compressed data block based on the information stored in the second key-value store.
This disclosure further proposes another key-value store for (partial and full equality) similarity detection in compressed data, which is different than the first key -value store for deduplication of compressed data. Possibly, each of the first key-value store and the second key-value store, may use different hash functions for respective key (hash) calculations.
In an implementation form of the first aspect, the data processing device is further configured to: determine a most similar previously compressed data block for a particular incoming compressed data block based on the information stored in the second key-value store; and if the most similar previously compressed data block has a partial similarity with the particular incoming compressed data block, perform the delta-compression on that particular compressed data block using the most similar previously compressed data block as a dictionary without decompressing the most similar previously compressed data block and the particular incoming compressed data block; if the most similar previously compressed data block is fully equal to the incoming compressed data block, perform deduplication of the particular incoming compressed data block. This disclosure proposes to delta-compress already compressed data, thereby eliminating the need to decompress both dictionaries (already stored or transmitted) and already compressed incoming data.
In an implementation form of the first aspect, the data processing device is further configured to: update the second key-value store with information of obtained samples from a particular incoming compressed data block, when no similar previously compressed data block is found for that particular incoming compressed data block.
Optionally, the second key-value stores are updated with information about the unique incoming compressed block, in particular with its samples and respective extra metadata.
In an implementation form of the first aspect, each incoming compressed data block comprises incompressible data and compressed data, and the data processing device is further configured to: sample the incoming compressed data block without decompressing it, when a length of the incompressible data at the beginning of the incoming compressed block is not below a first threshold, and/or when a size of the incoming compressed data block is not below a second threshold; or sample the incoming compressed data block without decompressing it, when a length of the incompressible data at the end of the incoming compressed block is not below a threshold and/or when a size of the incoming compressed data block is not below a second threshold; or sample the incoming compressed data block inside of the compressed data without decompressing it, when one or more repeating offsets are encoded by the specialized similarity detection aware compression algorithm; and/or when a size of the incoming compressed data block is not below a second threshold.
In an implementation form of the first aspect, the plurality of samples for each incoming compressed data block comprises one or more of the following:
- a sample at the beginning of the incoming compressed data block,
- a sample at the end of the incoming compressed data block, and
- one or more samples inside of the compressed data of the incoming compressed data block. Notably, for an incompressible data block, the sample will not be taken.
In an implementation form of the first aspect, when the sequence of processed data blocks comprises the one or more incoming incompressible data blocks, the data processing device is further configured to: scan each incoming incompressible data block by the specialized similarity detection aware compression algorithm, to obtain a plurality of hashes for the incoming incompressible data block; determine one or more hashes with minimal values for the incoming incompressible data block among the plurality of hashes.
In an implementation form of the first aspect, the data processing device is further configured to: maintain a third key-value store keeping information about at least a part of previously scanned incoming incompressible data block s, wherein the information of each previously scanned incoming incompressible data block comprises at least one of: one or more hashes with a minimal values, and an identifier of the previously scanned incoming incompressible data block; and wherein performing the similarity detection on the each incompressible data block in the sequence of processed data blocks in the intermediate data structure comprises determining whether an incompressible data block of the sequence has a partial similarity with, or is fully equal to a previously scanned incompressible data block based on the information stored in the third key-value store.
This disclosure further proposes a third key-value store used for similarity detection among non-compressible blocks. It should be noted that non-compressible blocks will not be sampled. The hashes with one or more minimal values (the lowest values hashes found by LZ-class compressor, and output as extra metadata) instead of samples, will be used as keys for lookup and updates of the third key -value store.
In an implementation form of the first aspect, the data processing device is further configured to: determine a most similar previously scanned incompressible data block for a particular incoming incompressible data block based on the information stored in the third key-value store; and if the most similar previously scanned incompressible data block has a partial similarity with the particular incoming incompressible data block, perform the deltacompression on that particular incoming incompressible data block using the most similar previously scanned incompressible data block as a dictionary; if the most similar previously scanned incompressible data block is fully equal to the incoming incompressible block, perform deduplication of that particular incoming incompressible data block.
Similar to the implementation for the incoming compressed data block, this disclosure proposes to delta-compress the incoming incompressible data block that a similar incompressible data block has been found, and deduplicate the incoming incompressible data block which is identical to a previously stored incompressible data block.
In an implementation form of the first aspect, the data processing device is further configured to: update the third key -value store with information about a particular incoming incompressible data block, when no similar previously scanned incompressible data block is found for that particular incoming incompressible data block.
Optionally, the third key-value store is updated with information about the unique incoming incompressible data block, in particular with its hashes with minimal values and respective extra metadata.
A second aspect of the disclosure provides a method performed by a data processing device for data processing, comprising: processing a sequence of incoming data blocks to obtain a sequence of processed data blocks by using a compression algorithm that produces byte-aligned output, wherein each processed data block of the sequence is a compressed data block or an incompressible data block; accumulating the sequence of processed data blocks into an intermediate data structure; performing a similarity detection on each data block in the sequence of processed data blocks that is accumulated in the intermediate data structure, to detect whether one or more similar data blocks have been stored in or received by the data processing device; and if a similar data block is found, depending on a similarity degree performing a deltacompression or deduplication on a data block in the sequence of processed data blocks that corresponds to the similar data block.
Implementation forms of the method of the second aspect may correspond to the implementation forms of the data processing device of the first aspect described above. The method of the second aspect and its implementation forms achieve the same advantages and effects as described above for the data processing device of the first aspect and its implementation forms. A third aspect of the disclosure provides a computer program or computer program product comprising a program code for carrying out, when implemented on a processor, the method according to the second aspect and any implementation forms of the second aspect.
A fourth aspect of the disclosure provides a computer-readable medium comprising instructions which, when executed by a computer, cause the computer to carry out the method according to the second aspect and any implementation forms of the second aspect.
It has to be noted that all devices, elements, units and means described in the present application could be implemented in software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.
BRIEF DESCRIPTION OF DRAWINGS
The above-described aspects and implementation forms of the present disclosure will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which:
FIG. 1 shows a data processing device according to an embodiment of the disclosure;
FIG. 2 shows compressed data according to an embodiment of the disclosure;
FIG. 3 shows compressed data according to an embodiment of the disclosure;
FIG. 4 shows compressed data according to an embodiment of the disclosure;
FIG. 5 shows a hash table n according to an embodiment of the disclosure;
FIG. 6 shows a hash table according to an embodiment of the disclosure;
FIG. 7 shows a flow chart according to an embodiment of the disclosure;
FIG. 8 shows a similarity detection procedure according to an embodiment of the disclosure; FIG. 9 shows a method according to an embodiment of the disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
Illustrative embodiments of a data processing device and a corresponding method for supporting similarity detection on compressed data, are described with reference to the figures. Although this description provides a detailed example of possible implementations, it should be noted that the details are intended to be exemplary and in no way limit the scope of the application.
Moreover, an embodiment or example may refer to other embodiments or examples. For example, any description including but not limited to terminology, element, process, explanation, and/or technical advantage mentioned in one embodiment/example is applicative to the other embodiments or examples.
FIG. 1 shows a data processing device 100 according to an embodiment of the disclosure. The data processing device 100 may comprise processing circuitry (not shown) configured to perform, conduct or initiate the various operations of the data processing device 100 described herein. The processing circuitry may comprise hardware and software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. The data processing device 100 may further comprise memory circuitry, which stores one or more instruction(s) that can be executed by the processor or by the processing circuitry, in particular under the control of the software. For instance, the memory circuitry may comprise a non-transitory storage medium storing executable software code which, when executed by the processor or the processing circuitry, causes the various operations of the data processing device 100 to be performed. In one embodiment, the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors. The non-transitory memory may carry executable program code which, when executed by the one or more processors, causes data processing device 100 to perform, conduct or initiate the operations or methods described herein.
The data processing device 100 is configured to process a sequence of incoming data blocks 101 to obtain a sequence of processed data blocks 102 by using a compression algorithm that produces byte-aligned output. In particular, each processed data block of the sequence is a compressed data block or an incompressible data block.
The data processing device 100 is further configured to accumulate the sequence of processed data blocks 102 into an intermediate data structure 103. The intermediate data structure 103 may be temporary storage such as a batch, or a queue, etc. Then, the data processing device 100 is configured to perform a similarity detection on each data block in the sequence of processed data blocks 102 that is accumulated in the intermediate data structure 103, to detect whether one or more similar data blocks have been stored in or received by the data processing device 100. If a similar data block is found, depending on a similarity degree, the data processing device 100 is further configured to perform a delta-compression or deduplication on a data block in the sequence of processed data blocks 102 that corresponds to the similar data block.
This disclosure proposes a data processing approach that relies on specialized compression algorithms that can produce byte-aligned output for the following similarity detection and deltacompression or deduplication. By using such specialized compression algorithms, the compressor outputs data aligned on a byte boundary.
Optionally, whether the data processing device 100 performs a delta-compression or deduplication on a data block depends on a result of the similarity detection, such as a similarity degree if a similar data block is found. In particular, the data processing device 100 is configured to perform the delta-compression on the data block, if the found similar data block is partially similar to that data block. “Partially similar” means that at least a part of the data block is identical to the found similar data block. The data processing device 100 is also configured to perform deduplication on the data block, if the found similar data block is fully equal to that data block. In the case that the found similar data block is fully equal to a particular data block, that particular data block is identical to the found similar data block.
According to an embodiment of the disclosure, the sequence of incoming data blocks 101 may comprise at least one of one or more incoming compressible data blocks and one or more incoming incompressible data blocks. That is, each incoming data block may be a compressible data block or an incompressible data block. In addition, the compression algorithm may be a specialized similarity detection aware compression algorithm. For the one or more incoming compressible data blocks, processing the sequence of incoming data blocks 101 may comprise compressing the one or more incoming compressible data blocks using the specialized similarity detection aware compression algorithm to obtain one or more incoming compressed data blocks and metadata associated with each of the one or more incoming compressed data blocks. For the one or more incoming incompressible data blocks, processing the sequence of incoming data blocks 101 may comprise calculating metadata associated with each of the one or more incoming incompressible data blocks using the specialized similarity detection aware compression algorithm.
Accordingly, the sequence of processed data blocks 102 comprises at least one of the one or more incoming compressed data blocks and one or more incoming incompressible data blocks, and metadata associated with each block of the sequence of the processed data blocks comprises the metadata associated with each incoming compressed data block, or the metadata associated with each incoming incompressible data block.
This disclosure proposes to process each incoming data block by a byte-aligned compressed format, in particular, a specialized compression LZ-class algorithm that is aware of the following similarity detection (SD) and possible Delta-compression (DC). Such compression algorithm may be referred to as an SDDC-aware compression algorithm in this disclosure.
SDDC-aware compression may use a dictionary-matching such as in LZ77, does not combine it with an entropy coding, and produce compressed data in byte-aligned format. The resulting compressed data is a finite ordered sequence of encoded elements.
FIG. 2 shows an example of compressed data after the SDDC-aware compression according to an embodiment of the disclosure. It is named “LZ compressed data” in the figure. Each element in the sequence, referred to as “lz_tuple” in this example, may start with a 24 bits tag, further referred to as “lz_tag”.
Optionally, the lz_tag may comprise or encode the following entities:
- offset of repeating input bytes (repeats); if input blocks have a fixed size, for instance, 4KB, then the offset can be encoded in <=12 bits. Other 12 bits of lz_tag are used to encode lengths of repeats and non-repeats, as described below. - length of encountered unique input bytes (non-repeats), e.g., it can use <= 6 bits. The number of bits to encode non-repeat length in lz_tag is further referred to as “non-repeat size” or “lz_non_repeat_bits”. These bits are used to encode non-repeat length <= (1 « lz_non_repeat_bits) - 2. This length is further referred to as short_non_repeat_length. For example, if lz_non_repeat_bits -= 6, then short_non_repeat_length == 62.
- length of repeating input bytes (repeats), e.g. it can use <=6 bits. The number of bits to encode repeat length in lz_tag is further referred as “repeat size” or “lz_repeat_bits”. These bits are used to encode repeat length <= (1 « lz_repeat_bits) - 2. This length is further referred to as short_repeat_length. For example, if lz_repeat_bits == 6, then short_repeat_length == 62.
The lz_tag may be followed by extended lengths and then by unique bytes, as in an open-source LZ4 algorithm. There may be encoded elements with 0 unique bytes. It should be noted that the first element of the compressed sequence always encodes at least one unique byte. The last element of compressed sequence may or may not encode unique bytes.
An example of using the specialized similarity detection aware compression algorithm for processing input data is described here. For example, the input data is: “ABRACADABRA”. The compression algorithm detects repeating sub-string “ABRA” in the input string “ABRACADABRA”. Therefore, from the compression algorithm point of view, the input string comprises a non-repeat sub-string (unique sub-string of 7 letters) “ABRACAD”, followed by a repeat sub-string “ABRA” (duplicate of the first 4 letters). The offset of the repeat “ABRA” in the input string is 0.
Encoded data comprises 24 bits Iz tag followed by non-repeat “ABRACAD”. The 24 bits Iz tag is <repeat offset == 0, non-repeat length == 7, repeat length ==4>.
This disclosure also proposes to collect extra metadata for each incoming data block during the compression procedure, using the SDDC-aware compression algorithm. Notably, the extra metadata is not a part of the compressed data, and the extra metadata is not needed for decompression. The extra metadata is used for facilitating the following partial similarity detection and possible delta compression in the compressed data. It may be worth mentioning that according to this disclosure, the already processed compressed data does not need to be decompressed for the partial similarity comparison and the following delta-compression. The extra metadata that is produced by the SDDC-aware compression is discussed as follows. According to an embodiment of the disclosure, the metadata associated with each of the processed data blocks in the sequence comprises one or more of the following:
- an indication about whether there is incompressible data at the beginning of the processed data block, and a length of the incompressible data,
- an indication about whether there is incompressible data at the end of the processed data block, a length of the incompressible data, and a position where the incompressible data starts in the processed data block,
- an indication about whether there are one or more repeating offsets, generated by using the specialized similarity detection aware compression algorithm, in the processed data block, and values of the one or more repeating offsets,
- an indication about whether there are one or more hashes with minimal values among hashes generated by the specialized similarity detection aware compression algorithm, and the one or more hashes with minimal values.
Notably, regarding the second indication mentioned above, the position where the incompressible data starts in the processed data, may refer to for example N bytes back from the last byte of the compressed data block.
Regarding the fourth indication mentioned above, for example, hash values produced by the compression algorithm may include: 212, 145, 98, 666, 32567, and 178. Among these values, there are 2 hashes with minimal values: 98 (absolute min), and 145 (next min after absolute one). According to this disclosure, two or more numbers of hashes with minimal values may be indicated by the extra metadata.
According to another embodiment of the disclosure, the metadata associated with each of the processed data blocks in the sequence may also comprise an indication about one or more most popular hash values, which are calculated during compression. Notably, the most popular hash value means that the hash value repeats more frequently during hash calculation.
For instance, when a compressor, particularly an LZ- compressor takes input data block, it scans input data sequentially to find repeating “words”. Usually, a “word” is 2/3/4 bytes. LZ-scanning is done by incrementing some index/pointer into the input data block and taking the next word. Assuming input is “ABRACADABRA” and the word size is 3 bytes. Then LZ-compressor takes the following words for analysis: ABR, BRA, RAC, ACA, CAD, etc.
To find repeating words, LZ-compressor usually performs hash calculation for each scanned word, then uses some sort of hash table, binary tree, etc. to lookup with this hash and to find candidate words for a possible match (word equality). Typically, a simple hash function may be used in LZ-class algorithms, therefore many hash collisions may happen. Hash collision happens when the same hash value, e.g., 2347, is calculated for different words, e.g. ABR, and CAD. Hash collision implies repeating hash. Hashes that repeat (collide) most frequently are output by the compressor as extra metadata. It should be noted that hash collision (hash repeating) happens for any data - either compressible or non-compressible.
Due to the simple hash function, there may be more frequent hashes (repeating more times, therefore more popular) and less frequent hashes (less popular). All hashes collide equally is also possible, but quite low probable. In such a case, hashes with the lowest value, or with the highest value may be considered for example, instead of more frequent hashes and less frequent hashes.
Optionally, the extra metadata may comprise, but is not limited to:
- 1 bit for the first element of the compressed sequence. The bit has value == 0 if first element contains <= short_non_repeat_length unique bytes, otherwise the bit has value = 1. Alternatively, there may be more bits to represent the content of the first element of the compressed sequence. The bit or bits may be further referred to as “head_bits” of the incoming block.
- 1 bit for the last element of the compressed sequence. The bit has value == 0 if the last element contains non-zero <= short_non_repeat_length unique bytes, otherwise the bit has value == 1. Alternatively, there may be more bits to represent the content of the last element of the compressed sequence. The bit or bits may be further referred to as “tail_bits” of the incoming block.
- zero or more most frequent (most popular) offsets in compressed data. If input data is non-compressible there are no offsets in compressed data. These popular offsets are further referred to as popular_offsets of the incoming block,
- one or more most frequent hash values, calculated during compression. Even if input data is non-compressible hash values are calculated by the compression algorithm, and due to hash collisions there may be more frequent hash values and less frequent ones. These hash values are further referred to as popular_hashes of the incoming block.
For instance, the extra metadata for the example string “ABRACADABRA” may include:
- head_bits: 1 bit for the 1st (and only) element of the compressed sequence, in this case, the head_bits has value 0;
- tail_bits: 1 bit for the last element, in this case, the tail_bits has value 0;
- popular offsets: one (and only) offset in compressed data, with value 0;
- popular_hashes: one most frequent hash, calculated during compression for the substring “ABRA”. The value of the hash depends on the hash function used by the compression algorithm. E.g., if crc32 is used, then hash value will be crc32(“ABRA”) == f8a4f73a.
According to another embodiment of the disclosure, the data processing device may be further configured to accumulate the sequence of processed data blocks 102 and the metadata associated with each block of the sequence of the processed data blocks into the intermediate data structure 103. Then the data processing device may be further configured to perform the similarity detection based on the metadata associated with each block of the sequence of the processed data blocks.
Optionally, this disclosure proposes that the incoming data blocks are compressed and then deduplicated in the compressed state. For instance, processing the sequence of incoming data blocks 101 further comprises detecting whether an incoming compressed data block or incoming incompressible data block of the sequence of processed data blocks 102 is a duplicate of a previously processed data block, and then replacing the detected duplicate by reference to the previously processed data block.
This disclosure further proposes a key-value store structure (or a hash-table) used for finding possible duplicates for incoming compressed data blocks among already processed compressed data blocks.
Optionally, the data processing device 100 may be further configured to maintain a first keyvalue store keeping information of previously processed data blocks. In particular, the information of each previously processed data block comprises a key and a value associated with that previously processed data block. The value comprises an identifier of that previously processed data block and at least a part of the metadata of the processed data block. The data processing device 100 may be further configured to detect whether an incoming compressed data block or incoming incompressible data block of the sequence of processed data blocks 102 is a duplicate by using the first key-value store.
Optionally, the data processing device 100 may be further configured to update the first keyvalue store with information of the incoming compressed data block or the incoming incompressible data block, when no duplicates are detected, by adding a key and a value associated with that incoming compressed data block or incoming incompressible data block to the first key-value store.
This disclosure further proposes that the deduplicated incoming compressed data blocks, together with respective extra metadata, which are accumulated in the intermediate data structure 103, will be passed with a delay to the following similarity detection procedure.
Optionally, the data processing device 100 may be further configured to set a delay after accumulating the sequence of processed data blocks. Then, the data processing device 100 may be configured to perform the similarity detection on the intermediate data structure 103 after the delay expires.
Notably, in use-cases with high data chum, data may be discarded (deleted) soon after it is sent to storage or transmitted. The delay allows to possibly decrease the number of blocks passed to the following similarity detection procedure.
Optionally, the data processing device 100 may be further configured to update the intermediate data structure 103 if a processed data block of the sequence is discarded or deleted before the delay expires.
According to an embodiment of the disclosure, when the sequence of processed data blocks 102 comprises the one or more incoming compressed data blocks, the data processing device is further configured to sample each incoming compressed data block of the sequence to obtain a plurality of samples for that incoming compressed data block. For each incoming compressed data block, without its decompression, several fixed-size samples may be taken to determine its possible similarity with already processed compressed data blocks. Notably, a taken sample comprises a number of sequential bytes of the compressed data block, not the bytes of incoming data block before compression. The sample size (number of bytes in sample) is further referred to as “sample_size”. Possibly, samples will not be taken, and the similarity is not detected, if the size of the compressed data is less than a threshold, i.e., an adaptively selected similarity detection threshold, further referred to as “sd_threshold”. For example, it may be designed that “sd_threshold” >= 2 x “sample_size”.
These samples may include:
- sample at the beginning of the compressed incoming block. This sample is further referred to as “head_sample”. If the incoming block is non-compressible, then no head_sample is taken. FIG. 3 shows an example of where the head_sample will be taken.
- sample at the end of the compressed incoming block. This sample is further referred to as “tail_sample”. If the incoming block is non-compressible then no tail_sample is taken. FIG. 3 shows an example of where the tail_sample will be taken.
- optionally one or more samples at the position of each lz_tag, if the lz_tag encodes one of the popular_offsets. The respective lz_tag should be neither the first nor the last in the encoded sequence. The sample is taken if the respective Iz tag encodes number of unique bytes (non-repeat) <= short_non_repeat_length. This sample is further referred to as “body_sample”. If the incoming block is non-compressible, then no body_sample is taken. FIG. 4 shows an example of where the body sample will be taken.
According to an embodiment of this disclosure, each incoming compressed data block comprises incompressible data and compressed data, and the data processing device is further configured to sample the incoming compressed data block without decompressing it, when a length of the incompressible data at the beginning of the incoming compressed block is not below a first threshold, and/or when a size of the incoming compressed data block is not below a second threshold.
Typically, at the beginning of a compressed data block, there is always a piece (a string of bytes) of incompressible data. This piece may be 1 byte or N bytes. Samples will be taken if N is bigger than the first threshold. Notably, this threshold is different from the above discussed “sd_threshold”, i.e., the second threshold, used for the overall size of the compressed data block. Optionally, the data processing device may be further configured to sample the incoming compressed data block without decompressing it, when a length of the incompressible data at the end of the incoming compressed block is not below a threshold and/or when a size of the incoming compressed data block is not below a second threshold.
Optionally, the data processing device may be further configured to sample the incoming compressed data block inside of the compressed data without decompressing it, when one or more repeating offsets are encoded by the specialized similarity detection aware compression algorithm; and/or when a size of the incoming compressed data block is not below a second threshold.
The plurality of samples for each incoming compressed data block may comprise one or more of the following:
- a sample at the beginning of the incoming compressed data block,
- a sample at the end of the incoming compressed data block, and
- one or more samples inside of the compressed data of the incoming compressed data block.
Possibly, the data processing device may be further configured to calculate a hash for each of the plurality of samples using a hash function.
For each taken sample, a hash value is calculated. It should be noted that the hash value calculated for the sample differs from popular_hashes found by the compression algorithm while compressing incoming blocks. According to this disclosure, either the same hash function or different hash functions can be used for different samples, i.e., head_sample, tail_sample, and body_sample hashing.
The data processing device may be further configured to generate a combined key for each of the plurality of samples by combining the calculated hash for that sample with one of the indications comprised in the metadata associated with the incoming compressed data block.
For instance, the hash value for head_sample may be combined with the indication about the first element of the compressed sequence, i.e., the head_bits. The hash value for tail_sample may be combined with the indication about the last element of the compressed sequence, i.e., the tail_bits. The hash value for body_sample may be combined with the indication about the one or more repeating offsets in the compressed data, i.e., respective popular_offset bits.
In an example, the hash values with head and tail bits are combined using bit-wise OR. Taken the hash function crc32 as an example, the results are further referred to as “head_sample_hash” and “tail_sample_hash”: head_sample_hash = (crc32(head_sample) « 1) | head_bit,
- tail_sample_hash = (crc32(tail_sample) « 1) | tail_bit.
In another example, the hash value is combined with popular_offset bits using XOR. Using the same example hash function crc32, the result is further referred to as “body sample hash”: body_sample_hash = (crc32(body_sample)) A offset_bits.
According to an embodiment of the disclosure, the data processing device is further configured to maintain a second key-value store keeping information about at least a part of previously obtained samples from a previously compressed data block. In particular, the information of each previously obtained sample comprises at least one of: a combined key associated with that previously obtained sample, an identifier of the previously compressed data block that the previously obtained sample belongs to, one or more repeating offsets, and one or more hashes with the minimal values associated with that previously compressed data block. Optionally, performing the similarity detection on each compressed block in the sequence of processed data blocks 102 may comprise determining whether a compressed data block of the sequence has a partial similarity with, or is fully equal to, a previously compressed data block based on the information stored in the second key-value store.
Each calculated hash for a respective sample of a compressed incoming block is used to look up into a key-value store, i.e., the second key-value store. The second key-value store may be implemented as a hash-table, where the key may be the hash of sample, and the value may be a pair of processed block ID and popular_hashes found by the compression algorithm in the course of compression of the said block. The value may be denoted as <blockID, popular_hashes>. The key-value is denoted as <head/tail/body sample hash, <block ID, popular_hashes». The design of the second key-value store allows to efficiently find one or more processed compressed blocks, containing samples with the same hash, to be possibly used as a dictionary for delta-compression of the compressed incoming block.
Optionally, one key-value store may be used for all hash lookups, i.e., head_sample_hash, tail sample hash, body_sample_hashes. Alternatively, separate key-value stores may be used, i.e., one key-value store for look-ups using head_sample_hash, one key-value store for lookups using tail_sample_hash, and another key-value store for look-ups using body_sample_hash.
Look-ups into key-value stores, using head_sample_hash, tail_sample_hash, body_sample_hashes of the compressed incoming block as keys, result in <blockID, popular_hashes> values, where the blockID corresponds to processed block. If no <blockID, popular_hashes> values are found, then the incoming compressed block is considered unique and it will not be delta-compressed.
FIG. 5 shows an example of a hash table for similarity detection according to an embodiment of the disclosure. The hash table comprises arrays of cell buckets, where columns are indexed by hash values.
For each sample, a hash is calculated to produce a scalar value in the range HASH_MIN to HASH MAX. The hash value is used as an index in the hash table.
A Hash table may be implemented as an array of cell buckets. There are many implementations of the hash table. This disclosure does not limit the choice of the implementations of the hash table. For simplicity of explanation hash table comprising an array of buckets is used here.
Each cell in the hash table contains blockID of the compressed data block that is either stored or transmitted, and ID of a sample (head, tail, body). The respective sample has a hash value equal to the index of the cell bucket. Besides blockID and samplelD, each cell may contain zero or more most popular (frequent) hashes found by the compressor in the course of respective (blockID) block compression.
FIG. 6 shows an implementation of the hash table for similarity detection. HashTablel [buckets_max][cells_max], and HashTable2, may contain buckets or columns of cells. A bucket is indexed by the hash value of the compressed block sample, and the bucket cell contains blockID, samplelD, and possibly other information.
Number of data blocks, stored or transferred is usually much bigger than the capacity of hash table. Therefore, possibly not all data blocks (blockIDs) are kept in the hash table for similarity detection.
Optionally, only the most popular blocks <blockID + samplelD + popular hash> are eventually kept in the hash table. Popular bucket cells will be Moved To the Top (MTT). Least popular cells at the bucket bottom are eventually overwritten by new <blockID + samplelD + popular hash>. Different hash functions can be used for each hash table to level off hash table fill.
Optionally, the data processing device 100 is further configured to determine a most similar previously compressed data block for a particular incoming compressed data block based on the information stored in the second key-value store. If the most similar previously compressed data block has a partial similarity with the particular incoming compressed data block, the data processing device 100 is further configured to perform the delta-compression on that particular compressed data block using the most similar previously compressed data block as a dictionary without decompressing the most similar previously compressed data block and the particular incoming compressed data block. If the most similar previously compressed data block is fully equal to the incoming compressed data block, the data processing device 100 is further configured to perform deduplication of the particular incoming compressed data block.
Notably, found <blockID, popular_hashes> values are sorted to count occurred blockIDs. Each blockID is associated with its occurrence count. The block ID with maximum occurrence count is selected as a candidate for a dictionary to delta-compress incoming compressed block.
When no maximum occurrence count is found, e.g., 3 different blockIDs are found with occurrence count == 1 each, then among these one blockID is selected with the best match of its popular_hashes to popular_hashes of the incoming block. The best match of popular_hashes is determined by comparing values of popular_hashes for incoming blocks and popular_hashes associated with each of found processed blocks. If the most similar processed compressed block is found, then it is used as a dictionary for deltacompression of the compressed incoming block. The incoming compressed block may be a complete duplicate of found similar processed compressed block. In this case, the incoming compressed block is deduplicated.
If no similar processed compressed block is found, then the incoming block is not delta- compressed.
Optionally, the data processing device 100 is further configured to update the second key-value store with information of obtained samples from a particular incoming compressed data block, when no similar previously compressed data block is found for that particular incoming compressed data block.
Key-value stores are updated with information about the unique incoming compressed block, in particular with its samples and respective extra metadata. After this update, the incoming compressed block may be found similar to other incoming compressed blocks, and therefore may be used as a dictionary for delta-compression.
According to an embodiment of this disclosure, when the sequence of processed data blocks 102 comprises the one or more incoming incompressible data blocks, the data processing device is further configured to scan each incoming incompressible data block by the specialized similarity detection aware compression algorithm, to obtain a plurality of hashes for the incoming incompressible data block; determine one or more hashes with minimal values for the incoming incompressible data block among the plurality of hashes.
As the example previously discussed, when an LZ-compressor takes input data block, it scans input data sequentially to find repeating “words”. Usually, a “word” is 2/3/4 bytes. LZ-scanning is done by incrementing some index/pointer into the input data block and taking the next word. Assuming input is “ABRACADABRA” and the word size is 3 bytes. Then LZ-compressor takes the following words for analysis: ABR, BRA, RAC, ACA, CAD, etc.
LZ-compressor usually performs hash calculation for each scanned word, then uses some sort of hash table, binary tree, etc. to find one or more hashes with minimal values. Therefore, hashes with the lowest values, or with the highest values may be easily found. According to an embodiment of this disclosure, the data processing device 100 may be further configured to maintain a third key-value store keeping information about at least a part of previously scanned incoming incompressible data blocks. The information of each previously scanned incoming incompressible data block comprises at least one of: one or more hashes with one or more minimal values, and an identifier of the previously scanned incoming incompressible data block. Optionally, performing the similarity detection on each incompressible data block in the sequence of processed data blocks 102 in the intermediate data structure comprises determining whether an incompressible data block of the sequence has a partial similarity with, or is fully equal to a previously scanned incompressible data block based on the information stored in the third key-value store.
This disclosure further proposes a third key-value store used for similarity detection among non-compressible blocks. It should be noted that non-compressible blocks will not be sampled. The hashes with one or more minimal values (the lowest values hashes found by LZ-class compressor, and output as extra metadata) instead of samples, will be used as keys for lookup and updates of the third key-value store.
According to an embodiment of this disclosure, the data processing device 100 may be further configured to determine a most similar previously scanned incompressible data block for a particular incoming incompressible data block based on the information stored in the third keyvalue store. If the most similar previously scanned incompressible data block has a partial similarity with the particular incoming incompressible data block, the data processing device 100 may be further configured to perform the delta-compression on that particular incoming incompressible data block using the most similar previously scanned incompressible data block as a dictionary. If the most similar previously scanned incompressible data block is fully equal to the incoming incompressible block, the data processing device 100 may be further configured to perform deduplication of that particular incoming incompressible data block.
Optionally, the data processing device 100 may be further configured to update the third keyvalue store with information about a particular incoming incompressible data block, when no similar previously scanned incompressible data block is found for that particular incoming incompressible data block. FIG. 7 shows an overall procedure of the proposed device according to an embodiment of the disclosure.
As depicted in FIG. 7, the proposed apparatus and method involve two stages. At the first stage, incoming data blocks are compressed and then deduplicated in a compressed state. During compression of the incoming data block, extra metadata is collected, which is neither needed for decompression nor deduplication but rather will be used later for similarity detection at the second stage. Deduplicated incoming compressed data blocks, together with respective extra metadata, are accumulated in a batch, a queue, or a similar data structure, to be passed with a delay to the second stage. In use-cases with high data chum, when data is discarded (deleted) soon after it is sent to storage or transmitted, the delay allows to possibly decrease the number of blocks passed to the second stage.
The second stage can be done in a separate (concurrent) process. It may accept a batch of compressed or non-compressible incoming data blocks together with respective extra metadata, accumulated in the 1 st stage. Using extra metadata similarity of the incoming compressed block to processed compressed blocks is determined by using sampling, hashing, and key-values (hash-tables) lookup. The most similar processed compressed block, if any found, is selected as a dictionary for delta-compression of an incoming compressed block. The delta-compressed incoming block is then stored or transmitted to its destination.
If no similar block is found, the incoming compressed block is considered unique. Key-value stores (hash-tables) are updated with information about the incoming compressed blocks, in particular with their samples and respective extra metadata. After this update, the incoming compressed block may be found similar to other incoming compressed blocks, and therefore may be used as a dictionary for delta-compression. The unique incoming compressed block is then stored or transmitted to its destination.
The first stage uses a separate key-value store (hash-table) to find a possible duplicate for incoming compressed data block among already processed compressed data blocks. If no duplicates are found at the second stage, a separate key-value store is updated with a hash calculated for incoming compressed block combined with its blockID. After this update, the incoming compressed block may be found to be duplicated to other incoming compressed blocks. Further, compressed or non-compressible incoming data blocks, together with extra metadata associated with each incoming data block, are accumulated in a batch (queue or any other proper data structure) to be passed to the second stage of the proposed apparatus and method.
Accumulating compressed or non-compressible incoming data blocks in a batch before passing to the second stage delays partial similarity detection and possible delta-compression. There are scenarios with high data chum when incoming blocks are deleted soon after they are passed to storage or transmitted via the network. The accumulation and respective delay in processing allow sparing computations of similarity detection and delta-compression for incoming blocks that would be deleted due to high chum.
The second stage can be done in a separate (concurrent) process. As input, it accepts a batch of compressed or non-compressible incoming data blocks together with respective extra metadata, accumulated in the first stage.
The main steps of the second stage include:
- determining similarity of an incoming compressed block to already processed compressed blocks - to find a most similar processed compressed block, if any exists;
- possible delta-compression of the incoming compressed block using most similar processed compressed block as a dictionary for delta-compression;
- if no similar block is found, the incoming data block is not delta-compressed. Keyvalue stores (hash-tables) are updated with information about the incoming compressed blocks, in particular with their samples and respective extra metadata. After this update, the incoming compressed block may be found similar to other incoming compressed blocks, and therefore may be used as a dictionary for delta-compression.
FIG. 8 shows an example of hash table updating after an incoming block according to an embodiment of this disclosure. In this example, after processing an incoming block blO using a compression algorithm that produces byte-aligned output, hashes hl, h2, h3, and h4 for samples taken from the incoming block bl 0 are obtained. After comparing these hashes with a key-value store or a hash table keeping information about at least a part of previously obtained samples from a previously compressed data block, a most similar previously compressed data block b2 is found. If the most similar previously compressed data block b2 has a partial similarity with the incoming block blO, delta-compression will be performed on blO using b2 as a dictionary without decompressing b2 and b 10. In addition, the hash table will be updated with a result of the delta-compression on b 10.
If b2 is fully equal to the incoming block blO, deduplication will be performed, e.g., blO may be replaced by a reference to the previously processed data block b2.
When no similar previously compressed data block is found for blO, the hash table will be updated with information of all obtained samples from blO.
FIG. 9 shows a method 900 according to an embodiment of the disclosure, particularly for groupcast transmission. In a particular embodiment, the method 900 is performed by a data processing device 100 shown in FIG. 1. The method 900 comprises a step 901 of processing a sequence of incoming data blocks 101 to obtain a sequence of processed data blocks 102 by using a compression algorithm that produces byte-aligned output. Each processed data block of the sequence is a compressed data block or an incompressible data block. ;
The method 900 comprises a step 902 of accumulating the sequence of processed data blocks 102 into an intermediate data structure, a step 903 of performing a similarity detection on each data block in the sequence of processed data blocks 102 that is accumulated in the intermediate data structure 103, to detect whether one or more similar data blocks have been stored in or received by the data processing device. If a similar data block is found, depending on a similarity degree, the method 900 further comprises a step 904 of performing a deltacompression or deduplication on a data block in the sequence of processed data blocks 102 that corresponds to the similar data block.
The present disclosure has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed embodiments of the disclosure, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.
Furthermore, any method according to embodiments of the disclosure may be implemented in a computer program, having code means, which when run by processing means causes the processing means to execute the steps of the method. The computer program is included in a computer-readable medium of a computer program product. The computer-readable medium may comprise essentially any memory, such as a ROM (Read-Only Memory), a PROM (Programmable Read-Only Memory), an EPROM (Erasable PROM), a Flash memory, an EEPROM (Electrically Erasable PROM), or a hard disk drive.
Moreover, it is realized by the skilled person that embodiments of the data processing device 100 comprise the necessary communication capabilities in the form of e.g., functions, means, units, elements, etc., for performing the solution. Examples of other such means, units, elements, and functions are: processors, memory, buffers, control logic, encoders, decoders, rate matchers, de-rate matchers, mapping units, multipliers, decision units, selecting units, switches, interleavers, de-interleavers, modulators, demodulators, inputs, outputs, antennas, amplifiers, receiver units, transmitter units, DSPs, trellis-coded modulation (TCM) encoder, TCM decoder, power supply units, power feeders, communication interfaces, communication protocols, etc. which are suitably arranged together for performing the solution.
Especially, the processor(s) of the data processing device 100 may comprise, e.g., one or more instances of a Central Processing Unit (CPU), a processing unit, a processing circuit, a processor, an Application Specific Integrated Circuit (ASIC), a microprocessor, or other processing logic that may interpret and execute instructions. The expression “processor” may thus represent a processing circuitry comprising a plurality of processing circuits, such as, e.g., any, some or all of the ones mentioned above. The processing circuitry may further perform data processing functions for inputting, outputting, and processing of data comprising data buffering and device control functions, such as call processing control, user interface control, or the like.

Claims

1. A data processing device (100) being configured to: process a sequence of incoming data blocks (101) to obtain a sequence of processed data blocks (102) by using a compression algorithm that produces byte-aligned output, wherein each processed data block of the sequence is a compressed data block or an incompressible data block; accumulate the sequence of processed data blocks (102) into an intermediate data structure (103); perform a similarity detection on each data block in the sequence of processed data blocks (102) that is accumulated in the intermediate data structure (103), to detect whether one or more similar data blocks have been stored in or received by the data processing device; and if a similar data block is found, depending on a similarity degree perform a deltacompression or deduplication on a data block in the sequence of processed data blocks (102) that corresponds to the similar data block.
2. The data processing device (100) according to claim 1, wherein depending on a similarity degree performing a delta-compression or deduplication on a data block comprises: performing the delta-compression on the data block, if the found similar data block is partially similar to that data block; and performing deduplication on the data block, if the found similar data block is fully equal to that data block.
3. The data processing device (100) according to claim 1 or 2, wherein the sequence of incoming data blocks (101) comprises at least one of, one or more incoming compressible data blocks and one or more incoming incompressible data blocks, and the compression algorithm is a specialized similarity detection aware compression algorithm, wherein processing the sequence of incoming data blocks (101) comprises: compressing the one or more incoming compressible data blocks using the specialized similarity detection aware compression algorithm to obtain one or more incoming compressed data blocks and metadata associated with each of the one or more incoming compressed data blocks, and/or calculating metadata associated with each of the one or more incoming incompressible data blocks using the specialized similarity detection aware compression algorithm, wherein the sequence of processed data blocks (102) comprises at least one of the one or more incoming compressed data blocks and one or more incoming incompressible data blocks, and metadata associated with each block of the sequence of the processed data blocks comprises the metadata associated with each incoming compressed data block, or the metadata associated with each incoming incompressible data block.
4. The data processing device (100) according to claim 3, configured to: accumulate the sequence of processed data blocks (102) and the metadata associated with each block of the sequence of the processed data blocks into the intermediate data structure (103); and perform the similarity detection based on the metadata associated with each block of the sequence of the processed data blocks.
5. The data processing device (100) according to claim 3 or 4, wherein processing the sequence of incoming data blocks (101) further comprises: detecting whether an incoming compressed data block or incoming incompressible data block of the sequence of processed data blocks (102) is a duplicate of a previously processed data block; and replacing the detected duplicate by reference to the previously processed data block.
6. The data processing device (100) according to claim 5, further configured to: maintain a first key-value store keeping information of previously processed data blocks, wherein the information of each previously processed data block comprises a key and a value associated with that previously processed data block, wherein the value comprises an identifier of that previously processed data block and at least a part of the metadata of the processed data block; and detect whether an incoming compressed data block or incoming incompressible data block of the sequence of processed data blocks (102) is a duplicate by using the first key-value store.
7. The data processing device (100) according to claim 5 or 6, further configured to: update the first key-value store with information of the incoming compressed data block or the incoming incompressible data block, when no duplicates is detected, by adding a key and a value associated with that incoming compressed data block or incoming incompressible data block to the first key-value store.
8. The data processing device (100) according to one of the claims 3 to 7, wherein the metadata associated with each of the processed data block in the sequence comprises one or more of the following:
- an indication about whether there is incompressible data at the beginning of the processed data block, and a length of the incompressible data,
- an indication about whether there is incompressible data at the end of the processed data block, a length of the incompressible data, and a position where the incompressible data starts in the processed data block,
- an indication about whether there are one or more repeating offsets, generated by using the specialized similarity detection aware compression algorithm, in the processed data block, and values of the one or more repeating offsets,
- an indication about whether there are one or more hashes with minimal values among hashes generated by the specialized similarity detection aware compression algorithm, and the one or more hashes with minimal values.
9. The data processing device (100) according to one of the claims 1 to 8, further configured to: set a delay after accumulating the sequence of processed data blocks; and perform the similarity detection on the intermediate data structure (103) after the delay expires.
10. The data processing device according to claim 9, further configured to: update the intermediate data structure (103) if a processed data block of the sequence is discarded or deleted before the delay expires.
11. The data processing device (100) according to one of the claims 3 to 10, and claim 8, wherein when the sequence of processed data blocks (102) comprises the one or more incoming compressed data blocks, the data processing device is further configured to: sample each incoming compressed data block of the sequence to obtain a plurality of samples for that incoming compressed data block; calculate a hash for each of the plurality of samples using a hash function; and generate a combined key for each of the plurality of samples by combining the calculated hash for that sample with one of the indications comprised in the metadata associated with incoming compressed data block.
12. The data processing device (100) according to claim 11 , further configured to: maintain a second key-value store keeping information about at least a part of previously obtained samples from a previously compressed data block, wherein the information of each previously obtained sample comprises at least one of: a combined key associated with that previously obtained sample, an identifier of the previously compressed data block that the previously obtained sample belongs to, one or more repeating offsets, and one or more hashes with the minimal values associated with that previously compressed data block; and wherein performing the similarity detection on each compressed block in the sequence of processed data blocks (102) comprises: determining whether a compressed data block of the sequence has a partial similarity with, or is fully equal to, a previously compressed data block based on the information stored in the second key-value store.
13. The data processing device (100) according to claim 12, further configured to: determine a most similar previously compressed data block for a particular incoming compressed data block based on the information stored in the second key-value store; and if the most similar previously compressed data block has a partial similarity with the particular incoming compressed data block, perform the delta-compression on that particular compressed data block using the most similar previously compressed data block as a dictionary without decompressing the most similar previously compressed data block and the particular incoming compressed data block; if the most similar previously compressed data block is fully equal to the incoming compressed data block, perform deduplication of the particular incoming compressed data block.
14. The data processing device (100) according to claim 12, further configured to: update the second key-value store with information of obtained samples from a particular incoming compressed data block, when no similar previously compressed data block is found for that particular incoming compressed data block.
15. The data processing device (100) according to one of the claims 11 to 14, wherein each incoming compressed data block comprises incompressible data and compressed data, and the data processing device is further configured to: sample the incoming compressed data block without decompressing it, when a length of the incompressible data at the beginning of the incoming compressed block is not below a first threshold, and/or when a size of the incoming compressed data block is not below a second threshold; or sample the incoming compressed data block without decompressing it, when a length of the incompressible data at the end of the incoming compressed block is not below a threshold and/or when a size of the incoming compressed data block is not below a second threshold; or sample the incoming compressed data block inside of the compressed data without decompressing it, when one or more repeating offsets are encoded by the specialized similarity detection aware compression algorithm; and/or when a size of the incoming compressed data block is not below a second threshold.
16. The data processing device (100) according to claim 15, wherein the plurality of samples for each incoming compressed data block comprises one or more of the following:
- a sample at the beginning of the incoming compressed data block,
- a sample at the end of the incoming compressed data block, and
- one or more samples inside of the compressed data of the incoming compressed data block.
17. The data processing device (100) according to claim 3 to 10, and claim 8, wherein when the sequence of processed data blocks (102) comprises the one or more incoming incompressible data blocks, the data processing device is further configured to: scan each incoming incompressible data block by the specialized similarity detection aware compression algorithm, to obtain a plurality of hashes for the incoming incompressible data block; determine one or more hashes with minimal values for the incoming incompressible data block among the plurality of hashes.
18. The data processing device (100) according to claim 17, further configured to: maintain a third key -value store keeping information about at least a part of previously scanned incoming incompressible data blocks, wherein the information of each previously scanned incoming incompressible data block comprises at least one of: one or more hashes with minimal values; an identifier of the previously scanned incoming incompressible data block; and wherein performing the similarity detection on each incompressible data block in the sequence of processed data blocks (102) in the intermediate data structure (103) comprises: determining whether an incompressible data block of the sequence has a partial similarity with, or is fully equal to a previously scanned incompressible data block based on the information stored in the third key-value store.
19. The data processing device (100) according to claim 18, further configured to: determine a most similar previously scanned incompressible data block for a particular incoming incompressible data block based on the information stored in the third key-value store; and if the most similar previously scanned incompressible data block has a partial similarity with the particular incoming incompressible data block, perform the delta-compression on that particular incoming incompressible data block using the most similar previously scanned incompressible data block as a dictionary; if the most similar previously scanned incompressible data block is fully equal to the incoming incompressible block, perform deduplication of that particular incoming incompressible data block.
20. The data processing device (100) according to claim 18, further configured to: update the third key-value store with information about a particular incoming incompressible data block, when no similar previously scanned incompressible data block is found for that particular incoming incompressible data block.
21. Method for data processing, the method comprising: processing a sequence of incoming data blocks (101) to obtain a sequence of processed data blocks (102) by using a compression algorithm that produces byte-aligned output, wherein each processed data block of the sequence is a compressed data block or an incompressible data block; accumulating the sequence of processed data blocks (102) into an intermediate data structure (103); performing a similarity detection on each data block in the sequence of processed data blocks (102) that is accumulated in the intermediate data structure (103), to detect whether one or more similar data blocks have been stored in or received by the data processing device; and if a similar data block is found, depending on a similarity degree performing a deltacompression or deduplication on a data block in the sequence of processed data blocks (102) that corresponds to the similar data block.
22. A computer program product comprising a program code for carrying out, when implemented on a processor, the method according to claim 21.
23. A computer-readable medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of claim 21.
PCT/RU2022/000006 2022-01-11 2022-01-11 Device and method for similarity detection of compressed data WO2023136740A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/RU2022/000006 WO2023136740A1 (en) 2022-01-11 2022-01-11 Device and method for similarity detection of compressed data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU2022/000006 WO2023136740A1 (en) 2022-01-11 2022-01-11 Device and method for similarity detection of compressed data

Publications (1)

Publication Number Publication Date
WO2023136740A1 true WO2023136740A1 (en) 2023-07-20

Family

ID=80738714

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU2022/000006 WO2023136740A1 (en) 2022-01-11 2022-01-11 Device and method for similarity detection of compressed data

Country Status (1)

Country Link
WO (1) WO2023136740A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100250896A1 (en) * 2009-03-30 2010-09-30 Hi/Fn, Inc. System and method for data deduplication
US20120185612A1 (en) * 2011-01-19 2012-07-19 Exar Corporation Apparatus and method of delta compression
US20200034452A1 (en) * 2018-07-30 2020-01-30 EMC IP Holding Company LLC Dual layer deduplication for a file system running over a deduplicated block storage
WO2021190739A1 (en) * 2020-03-25 2021-09-30 Huawei Technologies Co., Ltd. Method and system of differential compression

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100250896A1 (en) * 2009-03-30 2010-09-30 Hi/Fn, Inc. System and method for data deduplication
US20120185612A1 (en) * 2011-01-19 2012-07-19 Exar Corporation Apparatus and method of delta compression
US20200034452A1 (en) * 2018-07-30 2020-01-30 EMC IP Holding Company LLC Dual layer deduplication for a file system running over a deduplicated block storage
WO2021190739A1 (en) * 2020-03-25 2021-09-30 Huawei Technologies Co., Ltd. Method and system of differential compression

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: "Data compression and deduplication", 30 April 2014 (2014-04-30), pages 1 - 1, XP093007889, Retrieved from the Internet <URL:https://library.netapp.com/ecmdocs/ECMP1354558/html/GUID-B0C5894F-6D20-4210-A031-D5CD39C7A029.html> [retrieved on 20221213] *
ARONOVICH L ET AL: "Similarity based deduplication with small data chunks", DISCRETE APPLIED MATHEMATICS, ELSEVIER, AMSTERDAM, NL, vol. 212, 23 October 2015 (2015-10-23), pages 10 - 22, XP029722553, ISSN: 0166-218X, DOI: 10.1016/J.DAM.2015.09.018 *
GUGIK DAVID: "Backup Compression and Deduplication: Good or Bad? Part I - Blog - Blogs - Quest Community", 5 April 2012 (2012-04-05), pages 1 - 5, XP093007910, Retrieved from the Internet <URL:https://www.quest.com/community/blogs/b/en/posts/backup-compression-and-deduplication-good-or-bad-part-i> [retrieved on 20221213] *
YEN MIAO-CHIANG ET AL: "Lightweight, Integrated Data Deduplication for Write Stress Reduction of Mobile Flash Storage", IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, IEEE, USA, vol. 37, no. 11, 1 November 2018 (2018-11-01), pages 2590 - 2600, XP011692597, ISSN: 0278-0070, [retrieved on 20181017], DOI: 10.1109/TCAD.2018.2857322 *

Similar Documents

Publication Publication Date Title
CN109716658B (en) Method and system for deleting repeated data based on similarity
US7403136B2 (en) Block data compression system, comprising a compression device and a decompression device and method for rapid block data compression with multi-byte search
US7667630B2 (en) Information compression-encoding device, its decoding device, method thereof, program thereof, and recording medium storing the program
US8456331B2 (en) System and method of compression and decompression
JP3634711B2 (en) Method and apparatus for compressing input data stream
EP2198522B1 (en) Optimized data stream compression using data-dependent chunking
US8106799B1 (en) Data compression and decompression using parallel processing
JP7031828B2 (en) Methods, devices, and systems for data compression and decompression of semantic values
Wandelt et al. Trends in genome compression
EP0628228A1 (en) Data compression using hashing
CN107682016B (en) Data compression method, data decompression method and related system
JP2004511928A (en) System and method for progressive and continuous data compression
CN114244373A (en) LZ series compression algorithm coding and decoding speed optimization method
WO2013095615A1 (en) Bitstream processing using coalesced buffers and delayed matching and enhanced memory writes
Bhattacharjee et al. Comparison study of lossless data compression algorithms for text data
US6707400B2 (en) Method and apparatus for fast longest match search
CN109075798B (en) Variable size symbol entropy-based data compression
Jiang et al. A rolling hash algorithm and the implementation to LZ4 data compression
Beal et al. Compressed parameterized pattern matching
EP2462696A1 (en) Compression of bitmaps and values
Rathore et al. A brief study of data compression algorithms
WO2023136740A1 (en) Device and method for similarity detection of compressed data
Radescu Transform methods used in lossless compression of text files
JP2536422B2 (en) Data compression device and data decompression device
US11196443B2 (en) Data compressor, data decompressor, and data compression/decompression system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22710198

Country of ref document: EP

Kind code of ref document: A1