CN112559452B - Data deduplication processing method, device, equipment and storage medium - Google Patents

Data deduplication processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN112559452B
CN112559452B CN202011439014.5A CN202011439014A CN112559452B CN 112559452 B CN112559452 B CN 112559452B CN 202011439014 A CN202011439014 A CN 202011439014A CN 112559452 B CN112559452 B CN 112559452B
Authority
CN
China
Prior art keywords
hash value
preset
target
target hash
data block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011439014.5A
Other languages
Chinese (zh)
Other versions
CN112559452A (en
Inventor
高华龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yunkuanzhiye Network Technology Co ltd
Original Assignee
Beijing Yunkuanzhiye Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yunkuanzhiye Network Technology Co ltd filed Critical Beijing Yunkuanzhiye Network Technology Co ltd
Priority to CN202011439014.5A priority Critical patent/CN112559452B/en
Publication of CN112559452A publication Critical patent/CN112559452A/en
Application granted granted Critical
Publication of CN112559452B publication Critical patent/CN112559452B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks

Abstract

The application provides a data deduplication processing method, a data deduplication processing device, data deduplication equipment and a storage medium, wherein whether the historical occurrence frequency of a target hash value is larger than or equal to a preset frequency is determined through the target hash value of a target data block, the target data block is determined to be a repeated data block under the condition that the historical occurrence frequency of the target hash value of the target data block is larger than or equal to the preset frequency, and the actual storage position of the target data block in a storage system is fed back to preset equipment. In the process of searching the target data block, the historical occurrence times of the target hash value of the target data block are referred to determine whether the target data block is a repeated data block. And the storage space required for recording the historical occurrence times of the target hash value is smaller, so that the retrieval efficiency of the target data block is improved, and the application of the deduplication technology in the super-large-scale storage is promoted.

Description

Data deduplication processing method, device, equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data deduplication processing method, apparatus, device, and storage medium.
Background
With the development of computer technology, people generate a large amount of data in work and life, and the large amount of data is stored in a storage system in a general situation. However, since redundant data may exist in the storage system, duplicate data needs to be deduplicated, i.e., deduplicated.
In the related art, the data de-duplication process generally divides the data file into a plurality of data blocks, further calculates fingerprint information of each of the plurality of data blocks, and performs a hash lookup using the fingerprint information of each of the data blocks as a key of each of the data blocks, thereby determining whether the data block is a duplicate data block. In the case where the data chunk is a duplicate data chunk, the index number of the data chunk may be stored. In the case where the data block is a new data block, the data block may be stored and meta-information or meta-data associated with the data block may be created, such as fingerprint information, storage location, etc. of the data block.
However, as data in storage systems continues to increase, so does the meta-information or metadata of data blocks. Not only is the storage space occupied by the meta-information or the meta-data continuously increased, but also the hash query efficiency is reduced, thereby limiting the application of the deduplication technology in the ultra-large scale storage.
Disclosure of Invention
The embodiment of the application provides a data deduplication processing method, a data deduplication processing device, data deduplication equipment and a storage medium, so as to solve the problems in the related art, and the technical scheme is as follows:
in a first aspect, an embodiment of the present application provides a data deduplication processing method, including:
determining whether the historical occurrence frequency of the target hash value is greater than or equal to a preset frequency or not according to the target hash value of the target data block;
determining the target data block as a repeated data block under the condition that the historical occurrence frequency of the target hash value is greater than or equal to a preset frequency;
and feeding back the actual storage position of the target data block in the storage system to a preset device.
In one embodiment, before determining whether the historical occurrence number of the target hash value is greater than or equal to a preset number according to the target hash value of the target data block, the method further includes:
receiving a position to be stored of the target data block in a storage system, wherein the position to be stored of the target data block is sent by the preset equipment;
the preset device is used for not writing the target data block in the to-be-stored position of the storage system under the condition that the to-be-stored position is different from the actual storage position.
In one embodiment, the method further comprises:
under the condition that the historical occurrence frequency of the target hash value is smaller than the preset frequency, updating the historical occurrence frequency of the target hash value to obtain the updated historical occurrence frequency of the target hash value;
determining whether a storage unit corresponding to the target hash value in a first preset storage area is full or not under the condition that the historical occurrence frequency of the target hash value after being updated is greater than or equal to a preset frequency;
and under the condition that the storage unit corresponding to the target hash value in the first preset storage area is not full, storing the target hash value and the position to be stored into the storage unit corresponding to the target hash value in the first preset storage area.
In one embodiment, in a case that a storage unit of the target hash value corresponding to the first preset storage area is not full, storing the target hash value and the to-be-stored location in the storage unit of the target hash value corresponding to the first preset storage area includes:
and under the condition that the storage unit corresponding to the target hash value in the first preset storage area is not full, storing the target hash value and the position to be stored into the storage unit corresponding to the target hash value in the first preset storage area, and deleting the record information of the target hash value in the storage unit corresponding to the second preset storage area.
In one embodiment, the method further comprises:
and returning the position to be stored to the preset equipment under the condition that the corresponding storage unit of the target hash value in the first preset storage area is full.
In one embodiment, the method further comprises:
and returning the position to be stored to the preset equipment under the condition that the historical occurrence frequency of the updated target hash value is less than the preset frequency.
In one embodiment, the first preset storage area is used for storing hash values with historical occurrence times larger than or equal to preset times and actual storage positions of data blocks corresponding to the hash values with the historical occurrence times larger than or equal to the preset times in the storage system.
In one embodiment, the method further comprises:
under the condition that the historical occurrence frequency of the target hash value is 0, determining whether a storage unit corresponding to the target hash value in a second preset storage area is full;
and under the condition that the storage unit corresponding to the target hash value in the second preset storage area is not full, recording the target hash value and the updated historical occurrence frequency of the target hash value in the storage unit corresponding to the target hash value in the second preset storage area.
In one embodiment, the method further comprises:
under the condition that the storage unit corresponding to the target hash value in the second preset storage area is full, deleting the record information corresponding to one or more hash values with the smallest historical occurrence frequency in the storage unit corresponding to the target hash value in the second preset storage area;
and recording the target hash value and the updated historical occurrence times of the target hash value in a corresponding storage unit of the target hash value in a second preset storage area.
In one embodiment, the second preset storage area is used for storing the hash value with the historical occurrence number smaller than the preset number and the historical occurrence number of the hash value with the historical occurrence number smaller than the preset number.
In a second aspect, an embodiment of the present application provides a data deduplication processing apparatus, where the apparatus includes:
the determining module is used for determining whether the historical occurrence frequency of the target hash value is greater than or equal to the preset frequency or not according to the target hash value of the target data block; determining the target data block as a repeated data block under the condition that the historical occurrence frequency of the target hash value is greater than or equal to a preset frequency;
and the feedback module is used for feeding back the actual storage position of the target data block in the storage system to the preset device.
In one embodiment, the apparatus further comprises: the receiving module is used for receiving the position to be stored of the target data block in the storage system, which is sent by the preset equipment, before the determining module determines whether the historical occurrence frequency of the target hash value is greater than or equal to the preset frequency according to the target hash value of the target data block;
the preset device is used for not writing the target data block in the to-be-stored position of the storage system under the condition that the to-be-stored position is different from the actual storage position.
In one embodiment, the apparatus further comprises: the device comprises an updating module and a storage module;
the updating module is used for updating the historical occurrence frequency of the target hash value under the condition that the historical occurrence frequency of the target hash value is smaller than the preset frequency to obtain the updated historical occurrence frequency of the target hash value;
the determination module is further configured to: determining whether a storage unit corresponding to the target hash value in a first preset storage area is full or not under the condition that the historical occurrence frequency of the target hash value after being updated is greater than or equal to a preset frequency;
the storage module is used for storing the target hash value and the position to be stored into the storage unit corresponding to the target hash value in the first preset storage area under the condition that the storage unit corresponding to the target hash value in the first preset storage area is not full.
In one embodiment, the apparatus further comprises: a deletion module;
the storage module is used for storing the target hash value and the position to be stored into the storage unit corresponding to the target hash value in the first preset storage area under the condition that the storage unit corresponding to the target hash value in the first preset storage area is not full;
the deleting module is used for deleting the record information of the target hash value in the corresponding storage unit in the second preset storage area.
In one embodiment, the feedback module is further configured to: and returning the position to be stored to the preset equipment under the condition that the corresponding storage unit of the target hash value in the first preset storage area is full.
In one embodiment, the feedback module is further configured to: and returning the position to be stored to the preset equipment under the condition that the historical occurrence frequency of the updated target hash value is less than the preset frequency.
In one embodiment, the first preset storage area is used for storing hash values with historical occurrence times larger than or equal to preset times and actual storage positions of data blocks corresponding to the hash values with the historical occurrence times larger than or equal to the preset times in the storage system.
In one embodiment, the determining module is further configured to: under the condition that the historical occurrence frequency of the target hash value is 0, determining whether a storage unit corresponding to the target hash value in a second preset storage area is full;
the device also includes: and the recording module is used for recording the target hash value and the updated historical occurrence frequency of the target hash value in the storage unit corresponding to the target hash value in the second preset storage area under the condition that the storage unit corresponding to the target hash value in the second preset storage area is not full.
In one embodiment, the apparatus further comprises: the deleting module is used for deleting the record information corresponding to one or more hash values with the smallest historical occurrence frequency in the storage unit corresponding to the target hash value in the second preset storage area under the condition that the storage unit corresponding to the target hash value in the second preset storage area is full;
the recording module is further configured to record the target hash value and the updated historical occurrence number of the target hash value in a storage unit corresponding to the target hash value in a second preset storage area.
In one embodiment, the second preset storage area is used for storing the hash value with the historical occurrence number smaller than the preset number and the historical occurrence number of the hash value with the historical occurrence number smaller than the preset number.
In a third aspect, an embodiment of the present application provides a data deduplication processing apparatus, where the apparatus includes: a memory and a processor. Wherein the memory and the processor are in communication with each other via an internal connection path, the memory is configured to store instructions, the processor is configured to execute the instructions stored by the memory, and the processor is configured to perform the method of any of the above aspects when the processor executes the instructions stored by the memory.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the computer program runs on a computer, the method in any one of the above-mentioned aspects is executed.
The advantages or beneficial effects in the above technical solution at least include: determining whether the historical occurrence frequency of the target hash value is greater than or equal to a preset frequency or not through the target hash value of the target data block, determining that the target data block is a repeated data block under the condition that the historical occurrence frequency of the target hash value of the target data block is greater than or equal to the preset frequency, and feeding back the actual storage position of the target data block in the storage system to preset equipment. In the process of searching the target data block, the historical occurrence times of the target hash value of the target data block are referred to determine whether the target data block is a repeated data block. And the storage space required for recording the historical occurrence times of the target hash value is smaller, so that the retrieval efficiency of the target data block is improved, and the application of the deduplication technology in the super-large-scale storage is promoted.
The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.
Drawings
In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.
Fig. 1 is a schematic diagram of a hash index apparatus according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a logical space according to an embodiment of the present application;
FIG. 3 is a flowchart of a data deduplication processing method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an application scenario according to another embodiment of the present application;
FIG. 5 is a flowchart of a data deduplication processing method according to another embodiment of the present application;
FIG. 6 is a flow chart of a data deduplication processing method according to another embodiment of the present application;
FIG. 7 is a flowchart of a data deduplication processing method according to another embodiment of the present application;
fig. 8 is a block diagram of a data deduplication processing apparatus according to an embodiment of the present application;
fig. 9 is a block diagram of a data deduplication processing apparatus according to an embodiment of the present application.
Detailed Description
In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
With the development of computer technology, people generate a large amount of data in work and life, and the large amount of data is stored in a storage system in a general situation. However, since redundant data may exist in the storage system, duplicate data needs to be deduplicated, i.e., deduplicated.
De-duplication (De-duplication) can be referred to as De-duplication for short, is a mainstream and very popular storage technology at present, and can effectively optimize storage capacity. Specifically, redundant data is eliminated by deleting duplicate data in the data set and retaining only one copy. Deduplication technology can reduce the need for physical storage space to a great extent, thereby meeting the increasing data storage needs. Thus, deduplication technology may bring many practical benefits, for example, the following may be included:
1) meets the requirements of Return On Investment (ROI) or Total Cost of Ownership (TCO).
2) The sharp increase of data can be effectively controlled.
3) The effective storage space is increased, and the storage efficiency is improved.
4) Saving the total storage cost and the management cost.
5) Network bandwidth for data transmission is saved.
6) The operation and maintenance costs such as space, power supply, cooling and the like are saved.
Deduplication (dedipe) technology is currently used in large numbers in data backup and archiving systems. Because a large amount of repeated data exists after the data is backed up for multiple times, the data can be deduplicated by using a deduplication technology. In fact, deduplication technology may be used in many situations, for example, in storage systems for online data, near-line data, offline data, and so on. In addition, the deduplication technology may also be implemented in a file system, a volume manager, a Network Attached Storage (NAS), and a Storage Area Network (SAN). In addition, the deduplication technology can also be used for data disaster recovery, data transmission and synchronization, and is used for data packaging as a data compression technology. The deduplication technology can help numerous applications to reduce data storage, save network bandwidth, improve storage efficiency, reduce backup windows, save cost, and the like.
The measure dimension of the deduplication technique may include deduplication rates (deduplication rates) and performance. The performance of deduplication technology depends on the specific implementation technology, and the deduplication rate may be determined by the characteristics and application mode of the data itself, with the influence factors being specifically shown in table 1 below. Currently, the deduplication rates published by various storage vendors vary from 20:1 to 500: 1.
TABLE 1
Figure BDA0002829898920000071
Figure BDA0002829898920000081
Specifically, the implementation points of the deduplication technology may include the following aspects:
1) and carrying out de-duplication or de-duplication on the data.
For example, temporal or spatial data is deduplicated or deduplicated, and global or local data is deduplicated or deduplicated. The realization technology of deduplication and the data deduplication rate, namely the data deduplication rate, can be directly determined according to which data is deduplicated or deduplicated. Data which changes along with time, such as periodic backup and filing data, has higher repetition rate than spatial data, and the deduplication technology is widely applied in the field of backup and filing. In addition, the global data has a higher repetition rate than the local data, so that the global data can obtain a higher data repetition rate.
2) When to perform deduplication or deduplication.
The data deduplication occasion is divided into the following situations: and online duplicate removal and offline duplicate removal. In the case of the online deduplication mode, data deduplication is performed while data is written into the storage system. Therefore, the online duplicate removal mode has less data volume actually transmitted or written, and is suitable for storage systems for data processing through Local Area Networks (LANs) or Wide Area Networks (WANs), such as Network backup archiving, remote disaster recovery systems, and the like. The online deduplication mode needs to perform file segmentation, data fingerprint calculation and Hash (Hash) lookup in real time, so that the consumption of system data is large. The offline deduplication mode may write data into the storage system first, and then perform deduplication processing with appropriate time. The offline deduplication mode is the exact opposite of the online deduplication mode, and the offline deduplication mode consumes less system data, but writes contain duplicate data, and requires more additional storage space to pre-store the data before deduplication. Therefore, the offline deduplication mode is suitable for Direct-Attached Storage (DAS) and Storage Area Network (SAN) Storage architectures, and data transmission does not occupy network bandwidth. In addition, the offline deduplication mode needs to ensure that there is a sufficient time window for data deduplication operations. Therefore, when to perform deduplication or deduplication is determined according to the actual storage application scenario.
3) Where to perform deduplication or deduplication.
Specifically, data deduplication can be performed at a Source end (Source) or a Target end (Target). Under the condition that the source end performs data deduplication or deduplication, data transmitted by the source end is data after deduplication, so that network bandwidth can be saved, but a large amount of source end system resources can be occupied. Under the condition that the target terminal performs data deduplication or deduplication, data is retransmitted after being transmitted to the target terminal, so that source terminal system resources are not occupied, but a large amount of network bandwidth is occupied. The advantages of data deduplication or deduplication at the target end are that it is transparent to the Application and has good interoperability, no special Application Program Interface (API) needs to be used, and the existing Application software can be directly applied without any modification.
4) How to perform deduplication or deduplication.
For example, deduplication technology includes many technical implementation details, including, for example, how files are sliced, how data chunk fingerprints are computed, how data chunk retrieval is performed, whether similar data detection and difference encoding techniques are used, whether data content is perceptible, whether parsing of the content is required, and so on. These implementation details are relevant to the specific implementation of deduplication or deduplication. The method and the device mainly aim at the same data detection technology, carry out duplicate elimination processing based on the binary file, and have wider applicability.
In a deduplication process of a storage system, a data file or a physical file is generally divided into a plurality of data blocks, fingerprint information of each data block in the plurality of data blocks is further calculated, and hash lookup is performed using the fingerprint information of each data block as a key of each data block. If the same fingerprint information is matched, the data block is represented as a repeated data block, and in this case, only the index number of the data block may be stored. If the same fingerprint information is not matched, it indicates that the data block is not a duplicate data block, i.e. the data block is a new and unique data block, in this case, the data block may be stored, and metadata or metadata related to the data block may be created, for example, the fingerprint information, the storage location, etc. of the data block. In this way, a data file or a physical file corresponds to a logical representation in the storage system, and the logical representation is composed of meta information or meta data corresponding to a plurality of data blocks in the data file or the physical file respectively. In the case of reading the data file or the physical file, the logical representation may be read first, and further, corresponding data blocks may be read from the storage system according to the meta information or the meta data in the logical representation, and the corresponding data blocks may constitute a copy of the data file or the physical file. It can be seen from the above process that the key techniques for deduplication mainly include file data block segmentation, data block fingerprint information calculation, and data block retrieval. The following describes the segmentation of the file data block, the calculation of the fingerprint information of the data block, and the retrieval of the data block, respectively.
1) File data block segmentation
Specifically, the file-level deduplication and the data block-level deduplication can be divided according to the granularity of deduplication. The file-level deduplication technology is also called Single Instance Store (SIS), and deduplication granularity of data block-level deduplication is smaller and can reach 4-24 KB. Obviously, data block level deduplication can provide higher data deduplication rates, and therefore, the currently predominant deduplication technique is data block level deduplication. The method of splitting the data block mainly includes, for example, fixed-size splitting (fixed-size splitting), content-based variable-length splitting (CDC) and sliding block splitting (sliding block). The fixed-length blocking Algorithm is used for segmenting a file by adopting a predefined block size, and calculating a weak check value of a data block formed after segmentation and a strong check value of a fifth version (Message Digest Algorithm, MD5) of a Message Digest Algorithm. The weak check value is mainly used for improving the performance of differential coding, the weak check value is calculated firstly and hash searching is carried out, and if the same weak check value is found, the MD5 strong check value is calculated and further hash searching is carried out. The calculation amount of the weak check value is much smaller than that of the strong check value of the MD5, so that the coding performance can be effectively improved. The fixed-length blocking algorithm has the advantages of simplicity and high performance. However, the fixed-length blocking algorithm is very sensitive to data insertion and deletion, and therefore, the calculation efficiency is low, and adjustment and optimization cannot be performed according to content changes.
The CDC algorithm is a variable-length chunking algorithm that applies data fingerprints, such as Rabin fingerprints, to segment a file into chunks of unequal length. Unlike fixed-length chunking algorithms, the CDC algorithm may perform chunk splitting based on file content, and thus, chunk sizes may vary. In the implementation of the CDC algorithm, a fixed-size, e.g., 48-byte, sliding window is used to compute the data fingerprint for the file data. If the fingerprint satisfies a certain condition, for example, if the value of the fingerprint modulo a specific integer is equal to a preset number, the window position is taken as the boundary of the block. However, the CDC algorithm may be ill-conditioned, for example, in a case that the fingerprint does not satisfy a preset condition, the boundary of the block cannot be determined, thereby causing the data block to be too large. In practice, the size of the data block may be limited, for example, upper and lower limits of the size of the data block may be set, so as to solve the problem of the data block being too large. In addition, the CDC algorithm is not sensitive to file content changes, e.g., inserting or deleting data affects only a few data blocks, leaving the remaining data blocks unaffected. In addition, the CDC algorithm is also disadvantageous, for example, the determination of the size of the data block is difficult, the overhead is too large if the granularity of the data block is too fine, and the deduplication effect is not good if the granularity of the data block is too coarse. Therefore, how to trade off the trade-off between the two is a difficulty.
The sliding block algorithm combines the advantages of fixed-length slicing and CDC slicing, e.g., fixed-size chunking. The slider algorithm may compute a weak check value for a fixed-length block of data first, and then compute a strong check value for MD5 if there is a match, and both are considered a block boundary. The data fragment preceding the data block is also a data block, which is of indefinite length. A data block boundary is also considered if the sliding window moves past a block size distance that still does not match. The sliding block algorithm is more efficient at handling insertion and deletion problems and is able to detect more redundant data than the CDC, with the disadvantage of being prone to data fragmentation.
2) Data block fingerprint computation
A data chunk fingerprint is an essential feature of a data chunk. Ideally, each data chunk has a unique data chunk fingerprint, i.e., different data chunks have different data chunk fingerprints. Data chunks themselves tend to be large, and therefore, the goal of a data chunk fingerprint is to expect a small representation of data, e.g., 16, 32, 64, 128 bytes, to distinguish between different data chunks. The data block fingerprint is usually obtained by performing a correlation mathematical operation on the content of the data block, for example, calculating the data block fingerprint by using a Hash function such as MD5, Secure Hash Algorithm (SHA) 1, SHA-256, SHA-512, One-Way Hash function (One-Way Hash), Rabin Hash (Rabin Hash), and the like. In addition, there are many string hash functions that can be used to compute a block fingerprint. However, hash functions may suffer from collision problems, i.e., different data chunks may produce the same data chunk fingerprint. Relatively speaking, the hash functions such as MD5, SHA series, etc. have lower collision probability, and therefore, can be generally used as a fingerprint calculation method. Of these, MD5 and SHA1 are 128 bits, SHA-X (X represents the number of bits) has a lower probability of collision occurrence, but the amount of calculation increases greatly. In practical applications, a trade-off between performance and data security is required. In addition, multiple hash algorithms may be used simultaneously to compute the data chunk fingerprints.
3) Data block retrieval
For a deduplication system with large storage capacity, the number of data blocks is very large, especially when the granularity of the data blocks is fine. Thus, performance can become a bottleneck in such a large fingerprint library retrieval. The specific search method may be various, for example, a dynamic array, a database, a red-black (RB) tree, a balanced binary tree (B-tree), a B + tree, a B-tree, a Hash Table (Hash Table), etc. Among them, Hash lookup or Hash Index (Hash Index) implemented based on a Hash table is widely adopted due to the lookup performance with O (1). Therefore, hash lookup or hash indexing may also be employed in the deduplication technique. Because the hash table is in memory, hash lookup or hash indexing consumes a large amount of memory resources, and the memory requirements need to be reasonably planned before deduplication is performed. For example, the memory requirement can be estimated according to the fingerprint length of the data block, the number of the data blocks, and the like. Wherein, the number of data blocks can be estimated according to the storage capacity and the average data block size.
In some cases, the hash table may also be referred to as a hash table, which is a data structure directly accessed according to a Key value (Key value). The key value can be mapped to a position in the hash table by the mapping function to access the record, so that the searching speed is accelerated. The mapping function may also be called a hash function, and the array storing the records is called a hash table. The searching process of the hash table is basically the same as the process of constructing the hash table, some keys can be directly found through addresses obtained after conversion by a hash function, and addresses obtained by other keys through the hash function may generate conflicts, and the addresses need to be searched according to a conflict processing method.
In addition, deduplication technology also involves data security. The data security here includes the following meanings: on the one hand, data block collisions and on the other hand data availability.
Specifically, the data block Fingerprint (FP) is usually calculated by using a hash function, such as MD5, SHA1, SHA-256, SHA-512, and the like. From a purely mathematical point of view, two data blocks are different if the fingerprints of the two data blocks are different. If two data chunks have the same fingerprint, it cannot be concluded whether the two data chunks are the same because the hash function would produce a collision. However, in some cases, the probability of a hash function collision is small, even lower than the probability of a disk crash. It is therefore generally approximated that: if the data block fingerprints are the same, the data blocks are the same. Because of the possibility of data block fingerprints collision, deduplication technology is rarely used in critical data storage applications, because of the huge economic loss that would occur once a collision occurs. To address this problem, there are two main solutions: firstly, byte-level complete comparison is performed on data blocks with the same fingerprint, but the difficulty is that the original data in the data blocks is difficult to obtain conveniently, and in addition, certain performance loss is generated. The second is to reduce the probability of collision as much as possible, that is, to adopt a better hash function, such as SHA-512, SHA-1024, etc., or a combination of two or more hash algorithms, which obviously affects the performance. Since the calculation amount of the weak check value is less than that of MD5, the weak check value of the target data block may be calculated first, and if the weak check value of the target data block is different from the weak check value of the source data block, it may not be necessary to calculate the MD5 check value of the target data block. If the weak parity value of the target data block is the same as the weak parity value of the source data block, the MD5 parity value of the target data block may be further calculated and the MD5 parity value of the target data block may be compared with the MD5 parity value of the source data block. This approach can greatly reduce the probability of collision generation with little performance penalty, and with optimization, there is little performance penalty.
Usually, the deduplicated data only holds a unique copy of the data, and if the copy is damaged, all relevant data files are not accessible, and the data availability pressure is higher than that of no deduplication. However, the data availability problem can be solved by using conventional data protection methods, and commonly used methods include data redundancy techniques, local backup and copy, remote backup and copy, error correction data encoding techniques, distributed storage techniques, and the like. The data redundancy technology includes, for example, Redundant Array of Independent Disks (RAID) 1, RAID5, RAID6, and the like. The error correction data encoding technique includes, for example, a hamming code, an Information Dispersion Algorithm (IDA), and the like. These techniques can effectively eliminate single point failures, thereby increasing data availability, but at the expense of space for security.
However, as data in storage systems continues to increase, so does the meta-information or metadata of data blocks. Not only is the storage space occupied by the meta-information or the meta-data continuously increased, but also the hash query efficiency is reduced, thereby limiting the application of the deduplication technology in the ultra-large scale storage. In order to solve the problem, an embodiment of the present application provides a data deduplication processing method. The method may be applied to the hash index apparatus shown in fig. 1. As shown in fig. 1, the hash index apparatus includes a sampling area processing module, a duplicate index module, a memory exchange unit reader, a logical space module, and the like. The logical space module is used for isolating the hash index device and the physical storage medium, and can also provide a continuously readable and writable logical space for the hash index device. The logical space may be a contiguous segment of storage space used to store the hash table. The storage space may be a storage space in a physical storage medium. The physical storage medium includes, but is not limited to, a Solid State Disk (SSD), a Hard Disk Drive (HDD), a magnetic Disk medium, an external storage, and the like.
The schematic diagram of the logical space is shown in fig. 2, and specifically, the logical space includes a sample area and a duplicate hash index area. The sampling area and the repeated hash index area may respectively correspond to a hash table, and the hash table may record a hash value and a location of a memory swap unit corresponding to the hash value. That is, the Key (Key) of the hash table is a hash value, and the value (value) of the hash table is the location of the memory exchange unit. The hash table may be constructed by a hash algorithm, which may specifically be a weak check value algorithm, for example, a hash value is added according to an integer type (int) and then modulo a certain prime number to obtain a location of the memory exchange unit. Because the hash value is large, the value obtained by calculating the hash value by adopting a hash algorithm is small. Therefore, different hash values may correspond to the same memory exchange unit, i.e., one memory exchange unit may correspond to multiple hash values.
The sampling area shown in fig. 2 is used to store the hash value that has occurred and the number of occurrences of the hash value. And under the condition that the occurrence frequency of any hash value is less than the preset frequency, namely the occurrence frequency of any hash value does not reach the threshold value, storing the hash value and the occurrence frequency of the hash value in the sampling area.
The repeated hash index area is used for storing hash values which appear for many times and position information, namely actual storage positions, of data blocks corresponding to the hash values in the storage system. And under the condition that the occurrence frequency of any hash value is greater than or equal to the preset frequency, namely the occurrence frequency of any hash value reaches a threshold value, storing the actual storage position of the hash value and the data block corresponding to the hash value in the storage system or the position to be stored, which is allocated for the last time, of the data block corresponding to the hash value in the repeated hash index area.
The memory swap unit in the sample area and the duplicate hash index area may be the smallest unit of one interaction between the memory and the physical storage medium, which may be 1 sector, or may be larger. For example, the logical space described above is in a physical storage medium from which the memory can read a memory swap unit, e.g., any one of the memory swap units in the sample area or any one of the memory swap units in the duplicate hash index area. The memory may modify the memory swap unit it reads, for example, write new data in the memory swap unit or modify existing data in the memory swap unit. Further, the memory may send the memory exchange unit whose modification is completed to the physical storage medium, so that the physical storage medium stores the memory exchange unit whose modification is completed.
The hash value shown in fig. 2 may specifically be a hash value of the data block, and for example, the hash value may be calculated by an algorithm such as MD5, SHA1, SHA128, SHA256, and the like. The reference count corresponding to a hash value may refer to the number of times the hash value appears in a sample region. The storage location in the duplicate hash index area may be an actual storage location, which is location information of the data block corresponding to the hash value stored in the storage system. The storage system may be the same as or may be different from the physical storage medium as described above.
Several of the modules described in fig. 1 are described below in conjunction with fig. 2.
The sample area processing module may be specifically configured to process the hash value and the reference count in the sample area as shown in fig. 2, thereby implementing insertion, lookup, and elimination of the hash value in the sample area.
The duplicate index module is specifically configured to process the hash value and the storage location in the duplicate hash index area as shown in fig. 2, so as to implement insertion and lookup of the hash value in the duplicate hash index area.
The memory exchange unit reader comprises a hash algorithm used for positioning, reading and writing the memory exchange unit in the sampling area or the repeated hash index area according to the hash value.
A detailed description is given below of a data deduplication processing method provided in the embodiments of the present application with reference to specific embodiments. Fig. 3 shows a flow chart of a data deduplication processing method according to an embodiment of the present application. As shown in fig. 3, the method may include:
s301, according to the target hash value of the target data block, whether the historical occurrence frequency of the target hash value is larger than or equal to the preset frequency is determined.
For example, the present embodiment may be applied to the application scenario shown in fig. 4. The application scenario includes a preset device 41, a hash index device 42, and a storage system 43. The preset device 41 may specifically be a user device. The structure of the hash index device 42 is described above, and is not described herein again. The storage system 43 is used to store data, e.g., blocks of data.
Specifically, the preset device 41 may send the target data block to the hash index device 42, and the hash index device 42 calculates the target hash value of the target data block. Alternatively, the preset device 41 may calculate a target hash value of the target data block and send the target hash value of the target data block to the hash index device 42. Specifically, the hash index apparatus 42 may determine whether the historical occurrence number of the target hash value is greater than or equal to a preset number according to the target hash value of the target data block. For example, the hash index 42 may query the duplicate hash index zone as shown in FIG. 2 for the presence of the target hash value. If the target hash value exists in the repeated hash index area, the historical occurrence frequency of the target hash value is larger than or equal to a preset frequency. In addition, the hash index device 42 may also query the sample area as shown in fig. 2 for the presence of the target hash value. If the target hash value exists in the sampling area, the historical occurrence frequency of the target hash value is smaller than the preset frequency. If the target hash value does not exist in the repeated hash index area and the sampling area, the historical occurrence frequency of the target hash value is 0, and the target data block is the latest data block.
S302, under the condition that the historical occurrence frequency of the target hash value is larger than or equal to the preset frequency, determining that the target data block is a repeated data block.
For example, in the case that the target hash value exists in the duplicate hash index region, the historical occurrence number of the target hash value is greater than or equal to a preset number, so that the target data block can be determined to be a duplicate data block.
And S303, feeding back the actual storage position of the target data block in the storage system to a preset device.
For example, in a case where the target data block is a duplicate data block, in order to avoid the preset device 41 repeatedly writing the target data block in the storage system 43 again, the hash index apparatus 42 may feed back an actual storage location of the target data block in the storage system 43 to the preset device 41.
In some embodiments, before determining whether the historical occurrence of the target hash value is greater than or equal to a preset number of times according to the target hash value of the target data block, the method further comprises: receiving a position to be stored of the target data block in a storage system, wherein the position to be stored of the target data block is sent by the preset equipment; the preset device is used for not writing the target data block in the to-be-stored position of the storage system under the condition that the to-be-stored position is different from the actual storage position.
For example, before the presetting device 41 writes new data, such as a target data block, into the storage system 43, the presetting device 41 may send the target data block to the hash index device 42 at a location to be stored in the storage system 43. Accordingly, the hash index apparatus 42 may receive the location of the target data block in the storage system 43, where the target data block is to be stored, sent by the preset device 41. For example, the to-be-stored location of the target data block in the storage system 43 sent by the preset device 41 may be denoted as P1, and the actual storage location of the target data block in the storage system 43 may be denoted as P2. In the case where P1 and P2 are different, the preset device 41 does not need to write the target data block in the storage system 43, indicating a hash hit. Specifically, before the preset device 41 sends P1 to the hash index 42, the preset device 41 may store a mapping relationship between the logical address of the target data block and P1. In the case that the P2 received by the preset device 41 is different from the P1, the preset device 41 may adjust the mapping relationship between the logical address of the target data block and the P1 to the mapping relationship between the logical address of the target data block and the P2. In addition, in the case where P1 and P2 are the same, indicating a hash miss, the preset device 41 may write the target data block at the P1 location in the storage system 43.
According to the embodiment of the application, whether the historical occurrence frequency of the target hash value is greater than or equal to the preset frequency is determined through the target hash value of the target data block, the target data block is determined to be the repeated data block under the condition that the historical occurrence frequency of the target hash value of the target data block is greater than or equal to the preset frequency, and the actual storage position of the target data block in the storage system is fed back to the preset device, so that the preset device is prevented from repeatedly writing the target data block on the to-be-stored position in the storage system under the condition that the to-be-stored position is different from the actual storage position. That is, in the process of retrieving the target data block, the metadata or the meta-information of the target data block is not completely depended on, but the historical occurrence number of the target hash value of the target data block is referred to determine whether the target data block is a repeated data block. And the storage space required for recording the historical occurrence number of the target hash value is smaller than the storage space occupied by the meta information or the meta data. Therefore, the storage space of the meta-information or the meta-data is saved, and the situation that the storage space occupied by the meta-information or the meta-data is continuously increased when the data in the storage system is continuously increased is avoided. Thereby improving the retrieval efficiency of the target data block. In addition, under the condition that the retrieval mode of the target data block is Hash query, the Hash searching efficiency can be improved, and the application of the duplication elimination technology in super-large-scale storage is promoted.
On the basis of the above embodiment, the method further includes the following steps as shown in fig. 5:
s501, under the condition that the historical occurrence frequency of the target hash value is smaller than the preset frequency, updating the historical occurrence frequency of the target hash value to obtain the updated historical occurrence frequency of the target hash value.
For example, in the case where the hash index device 42 determines that the target hash value exists in the sample area, it indicates that the historical occurrence number of the target hash value is less than a preset number. In this case, the hash index apparatus 42 may update the historical occurrence count of the target hash value, for example, add 1 to the historical occurrence count of the target hash value, thereby obtaining the updated historical occurrence count of the target hash value. Further, the hash index apparatus 42 may further determine whether the historical occurrence number of the updated target hash value is greater than or equal to a preset number, that is, whether the historical occurrence number of the updated target hash value reaches a threshold.
And S502, returning the position to be stored to the preset device under the condition that the historical occurrence frequency of the updated target hash value is less than the preset frequency.
For example, in a case that the updated history occurrence number of the target hash value does not reach the threshold value, that is, is less than the preset number, the hash index apparatus 42 may return the to-be-stored location P1 to the preset device 41.
S503, determining whether the storage unit corresponding to the target hash value in the first preset storage area is full when the updated historical occurrence number of the target hash value is greater than or equal to a preset number.
For example, in a case that the updated historical occurrence number of the target hash value reaches a threshold value, that is, is greater than or equal to a preset number, the hash index apparatus 42 may further determine whether the storage unit corresponding to the target hash value in the first preset storage area is full. The first predetermined storage area may be a duplicate hash index area as described above. The storage unit of the target hash value in the first preset storage area may specifically be a memory exchange unit of the target hash value in the repeated hash index area. Specifically, the memory exchange unit corresponding to the target hash value in the duplicate hash index zone may be determined according to the target hash value and the hash table corresponding to the duplicate hash index zone as described above.
S504, under the condition that the storage unit corresponding to the target hash value in the first preset storage area is not full, the target hash value and the position to be stored are stored in the storage unit corresponding to the target hash value in the first preset storage area.
For example, the target hash value may be denoted as H. In the case that the memory exchange unit corresponding to the target hash value in the duplicate hash index zone is not full, the hash index apparatus 42 may store the target hash value H and the to-be-stored location P1 into the memory exchange unit corresponding to the target hash value in the duplicate hash index zone.
Optionally, when the storage unit of the target hash value corresponding to the first preset storage area is not full, storing the target hash value and the to-be-stored location in the storage unit of the target hash value corresponding to the first preset storage area, including: and under the condition that the storage unit corresponding to the target hash value in the first preset storage area is not full, storing the target hash value and the position to be stored into the storage unit corresponding to the target hash value in the first preset storage area, and deleting the record information of the target hash value in the storage unit corresponding to the second preset storage area.
For example, in a case that the memory exchange unit corresponding to the target hash value in the duplicate hash index area is not full, the hash index apparatus 42 may store the target hash value H and the to-be-stored location P1 in the memory exchange unit corresponding to the target hash value in the duplicate hash index area, and at the same time, the hash index apparatus 42 may delete the record information of the target hash value H in the memory exchange unit corresponding to the target hash value in the second predetermined storage area. Wherein the second preset storage area may be a sampling area as described above. The storage unit of the target hash value H in the second preset storage area may be a memory exchange unit of the target hash value H in the sampling area. The record information of the target hash value H in the corresponding memory exchange unit in the sampling area may include the target hash value H and a reference count of the target hash value H, that is, a history occurrence number.
Optionally, the first preset storage area is configured to store a hash value of which the historical occurrence number is greater than or equal to the preset number, and an actual storage location of a data block in the storage system, where the data block corresponds to the hash value of which the historical occurrence number is greater than or equal to the preset number. For example, the first predetermined storage area may be a duplicate hash index area as described above.
Optionally, the second preset storage area is configured to store the hash value with the historical occurrence number smaller than the preset number, and the historical occurrence number of the hash value with the historical occurrence number smaller than the preset number. For example, the second preset memory area may be a sampling area as described above.
And S505, returning the position to be stored to the preset device under the condition that the corresponding storage unit of the target hash value in the first preset storage area is full.
For example, in the case that the corresponding memory exchange unit of the target hash value in the duplicate hash index zone is full, the hash index apparatus 42 may return the to-be-stored location P1 to the predetermined device 41.
In the embodiment of the application, the historical occurrence frequency of the target hash value is updated under the condition that the historical occurrence frequency of the target hash value is smaller than the preset frequency, so that the updated historical occurrence frequency of the target hash value is obtained. And under the condition that the historical occurrence frequency after the target hash value is updated is greater than or equal to the preset frequency, indicating that the target data block corresponding to the target hash value is a repeated data block. Further, whether the storage unit corresponding to the target hash value in the first preset storage area is full is determined. And returning the to-be-stored position to the preset device instead of returning the actual storage position of the target data block in the storage system to the preset device under the condition that the storage unit corresponding to the target hash value in the first preset storage area is full. That is, when the storage unit is full, the target hash value and the actual storage location are not added to the storage unit, so that the original content in the storage unit can be prevented from being overwritten, that is, the original content in the storage unit can be prevented from being overwritten by rewriting the target data block once according to the returned to-be-stored location by the preset device. In data block level deduplication, the hash index mainly serves to access an existing data block, and to prevent a data block that has been written into the storage system from being repeatedly written. However, in the case where the written data block is ignored, the deduplication logic is not affected, and the ignored data block only needs to be rewritten once in the storage system.
On the basis of the above embodiment, the method further includes the following steps as shown in fig. 6:
s601, under the condition that the historical occurrence frequency of the target hash value is 0, determining whether the storage unit corresponding to the target hash value in the second preset storage area is full.
For example, in the case where the hash index device 42 determines that the target hash value does not exist in both the duplicate hash index area and the sample area, it indicates that the historical occurrence number of the target hash value is 0, and the target data block is the latest occurring data block. In this case, the hash index device 42 may determine, according to the target hash value, a corresponding memory exchange unit of the target hash value in a second preset storage area, for example, a sample area, and specifically, may determine according to the target hash value and a hash table corresponding to the sample area. Further, hash index 42 may determine whether the target hash value is full in the corresponding memory exchange unit in the sample region.
S602, when the storage unit of the target hash value corresponding to the second preset storage area is not full, recording the target hash value and the updated historical occurrence number of the target hash value in the storage unit of the target hash value corresponding to the second preset storage area.
In the case that the corresponding memory exchange unit of the target hash value in the sampling area is not full, the hash index device 42 may update the historical occurrence number of the target hash value, for example, add 1 to the historical occurrence number of the target hash value, so as to obtain the updated historical occurrence number of the target hash value, that is, 1. Further, the hash index device 42 may record the target hash value and the updated historical occurrence number 1 of the target hash value into the corresponding memory exchange unit of the target hash value in the sample area.
And S603, under the condition that the storage unit corresponding to the target hash value in the second preset storage area is full, deleting the record information corresponding to one or more hash values with the smallest historical occurrence frequency in the storage unit corresponding to the target hash value in the second preset storage area.
In the case that the memory exchange unit corresponding to the target hash value in the sampling area is full, the hash index device 42 may delete the record information corresponding to one or more hash values with the smallest historical occurrence number in the memory exchange unit.
S604, recording the target hash value and the updated historical occurrence frequency of the target hash value in a storage unit corresponding to the target hash value in a second preset storage area.
For example, in the case that the hash index device 42 deletes the record information corresponding to one or more hash values with the smallest historical occurrence number in the memory exchange unit, the target hash value and the updated historical occurrence number 1 of the target hash value may be further recorded in the memory exchange unit corresponding to the target hash value in the sampling area.
In the embodiment of the application, whether the storage unit corresponding to the target hash value in the second preset storage area is full is determined by determining that the historical occurrence frequency of the target hash value is 0. And under the condition that the storage unit corresponding to the target hash value in the second preset storage area is full, deleting the record information corresponding to one or more hash values with the smallest historical occurrence frequency in the storage unit corresponding to the target hash value in the second preset storage area. Further, the target hash value and the updated historical occurrence number of the target hash value are recorded in a storage unit corresponding to the target hash value in a second preset storage area. Since in most scenarios non-repeating data blocks, e.g. data blocks that occur only once, are much larger than repeating data blocks, and repeating data blocks tend to be data blocks of a certain significance, e.g. all zero data blocks, booting of an operating system in a virtual environment, data blocks of the same certain significance in a file system, etc. It is more meaningful to store the record information of the data blocks that occur repeatedly a plurality of times in the sample area than to store the record information of the data blocks that occur once, and therefore, in the case where the corresponding memory exchange unit of the target hash value in the sample area is full, the record information of the data blocks that occur once in the memory exchange unit can be deleted. In addition, in some embodiments, by increasing the size of the sample area, deletion of a single-occurrence data block in the sample area can be effectively avoided, or the time for which the single-occurrence data block is retained in the sample area can be prolonged.
The data deduplication processing method is described below with reference to a specific embodiment. As shown in fig. 7, the data deduplication processing method includes the following steps:
and S701, starting.
S702, receiving a to-be-stored position P1 of a target data block, wherein the target hash value of the target data block is H.
And S703, judging whether the H is in the repeated reference area. If so, S713 is performed, otherwise, S704 is performed.
And S704, judging whether H is in the sampling area. If so, S708 is performed, otherwise, S705 is performed.
S705, determine whether the memory swap unit corresponding to H in the sampling area is full. If so, S706 is performed, otherwise, S707 is performed.
S706, deleting one or more pieces of record information with the minimum reference count in the memory exchange unit.
And S707, adding a piece of record information in the memory exchange unit, wherein the record information comprises H and a reference count 1.
And S708, adding 1 to the reference count corresponding to the H.
S709, judging whether the reference count corresponding to the H reaches a threshold value. If so, S710 is performed, otherwise, S712 is performed.
S710, judging whether the corresponding memory exchange unit of H in the repeated index area is full. If so, S712 is performed, otherwise, S711 is performed.
And S711, adding H and P1 into the corresponding memory exchange unit of H in the repeated index area, and deleting the recorded information of H in the corresponding memory exchange unit of the sampling area.
And S712, returning to P1.
S713, the corresponding storage position P2 of the repeated index area is returned H.
And S714, ending.
Specifically, the implementation process and the specific principle of S701-S714 may refer to the content described in the foregoing embodiments, and are not described herein again.
In addition, the process of S701 to S714 can be described as "a flow of acquiring a storage location from a hash value". The initial condition of the flow is that the hash value and the storage position when the data block is not hit are the positions to be stored. The hash value may be externally transmitted to the hash index device 42, for example, by the predetermined device 41. The output of the process is the storage location to which the calculated data block should correspond. That is, in the process of performing deduplication processing on a data block, when processing write Input/Output (I/O), a data block may be obtained by first performing fragmentation processing on data. Further, the data block is allocated a blank storage location such as the to-be-stored location P1. After the "flow of obtaining a storage location according to the hash value" is called, if the obtained storage location P2 is different from the to-be-stored location P1, which indicates that the hash is hit, the data block does not need to be written repeatedly. Further, the preset device 41 may adjust the mapping relationship between the logical address of the target data block and P1 to the mapping relationship between the logical address of the target data block and P2. If the obtained storage location P2 is the same as the to-be-stored location P1, indicating a miss, the default device 41 may write the data block at the P1 location in the storage system 43.
According to the embodiment of the application, whether newly written data are repeated with written data or not can be detected at high speed through the hash index device, and the hash value with high repetition rate can be saved. Therefore, the problems that the hash index or the hash table used for data deduplication based on the data block level is nearly linearly increased along with the increase of the storage capacity and the index rate is reduced due to the fact that the storage space occupied by metadata or meta-information is too large in large-capacity storage are avoided. Further avoiding the hash index technique becomes a key technique that hinders the application of deduplication in very large scale storage.
Fig. 8 is a block diagram illustrating a data deduplication processing apparatus according to an embodiment of the present application. The data deduplication processing means may be hash indexing means as described above. As shown in fig. 8, the apparatus 80 may include:
the determining module 81 determines whether the historical occurrence frequency of the target hash value is greater than or equal to a preset frequency according to the target hash value of the target data block; determining the target data block as a repeated data block under the condition that the historical occurrence frequency of the target hash value is greater than or equal to a preset frequency;
and a feedback module 82, configured to feed back an actual storage location of the target data block in the storage system to a preset device.
Optionally, the apparatus 80 further comprises: a receiving module 83, configured to receive a to-be-stored location of a target data block in a storage system, where the to-be-stored location is sent by the preset device, before the determining module determines, according to a target hash value of the target data block, whether a historical occurrence number of the target hash value is greater than or equal to a preset number; the preset device is used for not writing the target data block in the to-be-stored position of the storage system under the condition that the to-be-stored position is different from the actual storage position.
Optionally, the apparatus 80 further comprises: an update module 84 and a storage module 85;
the updating module 84 is configured to update the historical occurrence frequency of the target hash value when the historical occurrence frequency of the target hash value is smaller than a preset frequency, so as to obtain the updated historical occurrence frequency of the target hash value;
the determining module 81 is further configured to: determining whether a storage unit corresponding to the target hash value in a first preset storage area is full or not under the condition that the historical occurrence frequency of the target hash value after being updated is greater than or equal to a preset frequency;
the storage module 85 is configured to store the target hash value and the to-be-stored location in the storage unit corresponding to the target hash value in the first preset storage area when the storage unit corresponding to the target hash value in the first preset storage area is not full.
Optionally, the apparatus 80 further comprises: a delete module 86; the storage module 85 is configured to store the target hash value and the to-be-stored location in the storage unit corresponding to the target hash value in the first preset storage area when the storage unit corresponding to the target hash value in the first preset storage area is not full; the deleting module 86 is configured to delete the recorded information of the target hash value in the corresponding storage unit in the second preset storage area.
Optionally, the feedback module 82 is further configured to: and returning the position to be stored to the preset equipment under the condition that the corresponding storage unit of the target hash value in the first preset storage area is full.
Optionally, the feedback module 82 is further configured to: and returning the position to be stored to the preset equipment under the condition that the historical occurrence frequency of the updated target hash value is less than the preset frequency.
Optionally, the first preset storage area is configured to store a hash value of which the historical occurrence number is greater than or equal to the preset number, and an actual storage location of a data block in the storage system, where the data block corresponds to the hash value of which the historical occurrence number is greater than or equal to the preset number.
Optionally, the determining module 81 is further configured to: under the condition that the historical occurrence frequency of the target hash value is 0, determining whether a storage unit corresponding to the target hash value in a second preset storage area is full;
the device also includes: a recording module 87, configured to record the target hash value and the updated historical occurrence number of the target hash value in the storage unit corresponding to the target hash value in the second preset storage area when the storage unit corresponding to the target hash value in the second preset storage area is not full.
Optionally, the apparatus 80 further comprises: a deleting module 86, configured to delete, when a storage unit corresponding to the target hash value in the second preset storage area is full, record information corresponding to one or more hash values with a smallest historical occurrence number in the storage unit corresponding to the target hash value in the second preset storage area; the recording module 87 is further configured to record the target hash value and the updated historical occurrence number of the target hash value in a storage unit corresponding to the target hash value in the second preset storage area.
Optionally, the second preset storage area is configured to store the hash value with the historical occurrence number smaller than the preset number, and the historical occurrence number of the hash value with the historical occurrence number smaller than the preset number.
The functions of each module in each apparatus in the embodiment of the present application may refer to corresponding descriptions in the above method, and are not described herein again.
Fig. 9 shows a block diagram of a data deduplication processing apparatus according to an embodiment of the present application. In this embodiment of the present application, the data deduplication processing apparatus may specifically be the hash indexing device in the above embodiment. As shown in fig. 9, the data deduplication processing apparatus includes: a memory 910 and a processor 920, the memory 910 having stored therein computer programs operable on the processor 920. The processor 920 implements the method of data deduplication processing in the above-described embodiment when executing the computer program. The number of the memory 910 and the processor 920 may be one or more.
The application program field restoration device further includes:
and a communication interface 930 for communicating with an external device to perform data interactive transmission.
If the memory 910, the processor 920 and the communication interface 930 are implemented independently, the memory 910, the processor 920 and the communication interface 930 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 9, but this does not indicate only one bus or one type of bus.
Optionally, in an implementation, if the memory 910, the processor 920 and the communication interface 930 are integrated on a chip, the memory 910, the processor 920 and the communication interface 930 may complete communication with each other through an internal interface.
Embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method provided in the embodiments of the present application.
The embodiment of the present application further provides a chip, where the chip includes a processor, and is configured to call and execute the instruction stored in the memory from the memory, so that the communication device in which the chip is installed executes the method provided in the embodiment of the present application.
An embodiment of the present application further provides a chip, including: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the application.
It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be an advanced reduced instruction set machine (ARM) architecture supported processor.
Further, optionally, the memory may include a read-only memory and a random access memory, and may further include a nonvolatile random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may include a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct memory bus RAM (DR RAM).
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the present application are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present application, and these should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (17)

1. A data deduplication processing method, the method comprising:
receiving a position to be stored of a target data block in a storage system, wherein the position to be stored is sent by preset equipment;
determining whether the historical occurrence frequency of the target hash value is greater than or equal to a preset frequency or not according to the target hash value of the target data block;
determining the target data block as a repeated data block under the condition that the historical occurrence frequency of the target hash value is greater than or equal to a preset frequency;
feeding back the actual storage position of the target data block in the storage system to a preset device;
under the condition that the historical occurrence frequency of the target hash value is smaller than a preset frequency, updating the historical occurrence frequency of the target hash value to obtain the updated historical occurrence frequency of the target hash value;
determining whether a storage unit corresponding to the target hash value in a first preset storage area is full or not under the condition that the historical occurrence frequency of the updated target hash value is greater than or equal to a preset frequency;
when the storage unit corresponding to the target hash value in a first preset storage area is full, returning the position to be stored to the preset equipment to enable the preset equipment to rewrite a target data block in the position to be stored;
under the condition that the historical occurrence frequency of the target hash value is 0, determining whether a storage unit corresponding to the target hash value in a second preset storage area is full;
under the condition that a storage unit corresponding to the target hash value in a second preset storage area is full, deleting the record information corresponding to one or more hash values with the smallest historical occurrence frequency in the storage unit corresponding to the target hash value in the second preset storage area;
and recording the target hash value and the updated historical occurrence times of the target hash value in a corresponding storage unit of the target hash value in a second preset storage area.
2. The method according to claim 1, wherein the preset device is configured to not write the target data block on the to-be-stored location of the storage system if the to-be-stored location and the actual storage location are different.
3. The method according to claim 1, wherein determining whether the storage unit corresponding to the target hash value in the first preset storage area is full when the updated historical occurrence number of the target hash value is greater than or equal to a preset number, further comprises:
and under the condition that the storage unit corresponding to the target hash value in the first preset storage area is not full, storing the target hash value and the position to be stored into the storage unit corresponding to the target hash value in the first preset storage area.
4. The method according to claim 1, wherein in a case that a storage unit of the target hash value corresponding to the first preset storage area is not full, storing the target hash value and the to-be-stored location in the storage unit of the target hash value corresponding to the first preset storage area comprises:
and under the condition that the storage unit corresponding to the target hash value in the first preset storage area is not full, storing the target hash value and the position to be stored into the storage unit corresponding to the target hash value in the first preset storage area, and deleting the record information of the target hash value in the storage unit corresponding to the second preset storage area.
5. The method of claim 1, further comprising:
and returning the position to be stored to the preset equipment under the condition that the updated historical occurrence times of the target hash value are less than preset times.
6. The method according to any one of claims 1 to 5, wherein the first preset storage area is used for storing the hash value with the historical occurrence number greater than or equal to a preset number, and the actual storage position of the data block in the storage system corresponding to the hash value with the historical occurrence number greater than or equal to the preset number.
7. The method of claim 1, further comprising:
and under the condition that the storage unit corresponding to the target hash value in a second preset storage area is not full, recording the target hash value and the updated historical occurrence times of the target hash value in the storage unit corresponding to the target hash value in the second preset storage area.
8. The method according to claim 1, wherein the second preset storage area is used for storing the hash value with the historical occurrence number smaller than the preset number and the historical occurrence number of the hash value with the historical occurrence number smaller than the preset number.
9. A data deduplication processing apparatus, the apparatus comprising:
the receiving module is used for receiving a position to be stored of a target data block in the storage system, wherein the position is sent by preset equipment;
the determining module is used for determining whether the historical occurrence times of the target hash value is greater than or equal to the preset times or not according to the target hash value of the target data block; determining the target data block as a repeated data block under the condition that the historical occurrence frequency of the target hash value is greater than or equal to a preset frequency;
the feedback module is used for feeding back the actual storage position of the target data block in the storage system to preset equipment;
the updating module is used for updating the historical occurrence frequency of the target hash value under the condition that the historical occurrence frequency of the target hash value is smaller than a preset frequency to obtain the updated historical occurrence frequency of the target hash value;
the determination module is further to: determining whether a storage unit corresponding to the target hash value in a first preset storage area is full or not under the condition that the historical occurrence frequency of the updated target hash value is greater than or equal to a preset frequency;
the feedback module is further configured to: when the storage unit corresponding to the target hash value in a first preset storage area is full, returning the position to be stored to the preset equipment to enable the preset equipment to rewrite a target data block in the position to be stored;
the determination module is further to: under the condition that the historical occurrence frequency of the target hash value is 0, determining whether a storage unit corresponding to the target hash value in a second preset storage area is full;
the deleting module is used for deleting the record information corresponding to one or more hash values with the smallest historical occurrence frequency in the storage unit corresponding to the target hash value in the second preset storage area under the condition that the storage unit corresponding to the target hash value in the second preset storage area is full;
and the recording module is used for recording the target hash value and the updated historical occurrence times of the target hash value in a corresponding storage unit of the target hash value in a second preset storage area.
10. The apparatus according to claim 9, wherein the preset device is configured to not write the target data block on the to-be-stored location of the storage system if the to-be-stored location and the actual storage location are different.
11. The apparatus of claim 9, wherein:
the device further comprises: a storage module;
the storage module is used for storing the target hash value and the position to be stored into the storage unit corresponding to the target hash value in the first preset storage area under the condition that the storage unit corresponding to the target hash value in the first preset storage area is not full;
the deleting module is further configured to, under the condition that a storage unit corresponding to the target hash value in a first preset storage area is not full, delete the record information of the target hash value in a storage unit corresponding to a second preset storage area after the storage module stores the target hash value and the to-be-stored position in the storage unit corresponding to the target hash value in the first preset storage area.
12. The apparatus of claim 9, wherein the feedback module is further configured to: and returning the position to be stored to the preset equipment under the condition that the updated historical occurrence times of the target hash value are less than preset times.
13. The apparatus according to any one of claims 9 to 12, wherein the first preset storage area is configured to store hash values with historical occurrence numbers greater than or equal to preset numbers, and actual storage locations of data blocks in the storage system corresponding to the hash values with the historical occurrence numbers greater than or equal to the preset numbers.
14. The apparatus according to claim 9, wherein the recording module is further configured to record the target hash value and the updated historical occurrence number of the target hash value in the storage unit of the target hash value corresponding to the second preset storage area, if the storage unit of the target hash value corresponding to the second preset storage area is not full.
15. The apparatus according to claim 9, wherein the second preset storage area is configured to store the hash value whose historical occurrence number is smaller than a preset number, and the historical occurrence number of the hash value whose historical occurrence number is smaller than the preset number.
16. A data deduplication processing apparatus, comprising: a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of claims 1 to 8.
17. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.
CN202011439014.5A 2020-12-11 2020-12-11 Data deduplication processing method, device, equipment and storage medium Active CN112559452B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011439014.5A CN112559452B (en) 2020-12-11 2020-12-11 Data deduplication processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011439014.5A CN112559452B (en) 2020-12-11 2020-12-11 Data deduplication processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112559452A CN112559452A (en) 2021-03-26
CN112559452B true CN112559452B (en) 2021-12-17

Family

ID=75060546

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011439014.5A Active CN112559452B (en) 2020-12-11 2020-12-11 Data deduplication processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112559452B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113485949B (en) * 2021-05-28 2023-06-20 浙江毫微米科技有限公司 Data processing method, chip and computer readable storage medium
CN113992612A (en) * 2021-09-15 2022-01-28 上海绚显科技有限公司 Message processing method and device, electronic equipment and storage medium
CN114064621B (en) * 2021-10-28 2022-07-15 江苏未至科技股份有限公司 Method for judging repeated data
CN114691430A (en) * 2022-04-24 2022-07-01 北京科技大学 Incremental backup method and system for CAD (computer-aided design) engineering data files
CN116010362A (en) * 2023-03-29 2023-04-25 世优(北京)科技有限公司 File storage and file reading method, device and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760101A (en) * 2012-05-22 2012-10-31 中国科学院计算技术研究所 SSD-based (Solid State Disk) cache management method and system
CN107632789A (en) * 2017-09-29 2018-01-26 郑州云海信息技术有限公司 Method, system and Data duplication detection method are deleted based on distributed storage again

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8572053B2 (en) * 2010-12-09 2013-10-29 Jeffrey Vincent TOFANO De-duplication indexing
CN103870514B (en) * 2012-12-18 2018-03-09 华为技术有限公司 Data de-duplication method and device
CN103309975B (en) * 2013-06-09 2017-04-26 华为技术有限公司 Duplicated data deleting method and apparatus
CN106610790B (en) * 2015-10-26 2020-01-03 华为技术有限公司 Method and device for deleting repeated data
CN105787037B (en) * 2016-02-25 2019-03-15 浪潮(北京)电子信息产业有限公司 A kind of delet method and device of repeated data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760101A (en) * 2012-05-22 2012-10-31 中国科学院计算技术研究所 SSD-based (Solid State Disk) cache management method and system
CN107632789A (en) * 2017-09-29 2018-01-26 郑州云海信息技术有限公司 Method, system and Data duplication detection method are deleted based on distributed storage again

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
移动闪存的重复数据删除技术;贺秦禄;《西安电子科技大学学报》;20200229;第47卷(第1期);第128-134页 *

Also Published As

Publication number Publication date
CN112559452A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
CN112559452B (en) Data deduplication processing method, device, equipment and storage medium
US9880746B1 (en) Method to increase random I/O performance with low memory overheads
CN107787489B (en) File storage system including a hierarchy
US10102150B1 (en) Adaptive smart data cache eviction
US9740403B2 (en) Methods for managing storage in a data storage cluster with distributed zones based on parity values and devices thereof
US9396073B2 (en) Optimizing restores of deduplicated data
US9715434B1 (en) System and method for estimating storage space needed to store data migrated from a source storage to a target storage
US8943032B1 (en) System and method for data migration using hybrid modes
US8949208B1 (en) System and method for bulk data movement between storage tiers
US9390116B1 (en) Insertion and eviction schemes for deduplicated cache system of a storage system
US8639669B1 (en) Method and apparatus for determining optimal chunk sizes of a deduplicated storage system
US20220236925A1 (en) Data structure storage and data management
US9928248B2 (en) Self-healing by hash-based deduplication
US10990310B2 (en) Sub-block data deduplication
US10936228B2 (en) Providing data deduplication in a data storage system with parallelized computation of crypto-digests for blocks of host I/O data
CN112463077B (en) Data block processing method, device, equipment and storage medium
US11347725B2 (en) Efficient handling of highly amortized metadata page updates in storage clusters with delta log-based architectures
US20190324916A1 (en) Compression of Host I/O Data in a Storage Processor of a Data Storage System with Selection of Data Compression Components Based on a Current Fullness Level of a Persistent Cache
US9959049B1 (en) Aggregated background processing in a data storage system to improve system resource utilization
WO2017020576A1 (en) Method and apparatus for file compaction in key-value storage system
WO2014108818A1 (en) Real-time classification of data into data compression domains
US10430273B2 (en) Cache based recovery of corrupted or missing data
US11449480B2 (en) Similarity hash for binary data pages
US10474587B1 (en) Smart weighted container data cache eviction
US10996853B2 (en) Deduplicated data block ownership determination

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant