CN111124312A - Data deduplication method and device - Google Patents

Data deduplication method and device Download PDF

Info

Publication number
CN111124312A
CN111124312A CN201911338293.3A CN201911338293A CN111124312A CN 111124312 A CN111124312 A CN 111124312A CN 201911338293 A CN201911338293 A CN 201911338293A CN 111124312 A CN111124312 A CN 111124312A
Authority
CN
China
Prior art keywords
hash value
input data
current input
nth
stored
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911338293.3A
Other languages
Chinese (zh)
Other versions
CN111124312B (en
Inventor
李嘉树
季成
卢冕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN201911338293.3A priority Critical patent/CN111124312B/en
Publication of CN111124312A publication Critical patent/CN111124312A/en
Application granted granted Critical
Publication of CN111124312B publication Critical patent/CN111124312B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method of data deduplication and an apparatus thereof are provided. The method comprises the following steps: carrying out the following operations on input data according to a pipeline mode according to a preset pipeline clearance of an FPGA chip in an FPGA board card: performing hash calculation on current input data by using an FPGA chip to obtain an addressing hash value and a storage hash value of the current input data; comparing the storage hash value of the current input data with all storage hash values in a preset step length from the addressing hash value in an nth-level memory stored in an N-level memory by using an FPGA chip to determine whether the current input data is repeated data, wherein a front M-level memory in the N-level memory belongs to an internal memory in the FPGA chip, a rear N-M-level memory in the N-level memory belongs to an external memory outside the FPGA chip in an FPGA board card, N, M and N are positive integers, N is more than or equal to 2, M is more than or equal to 1 and less than or equal to N-1, and N is more than or equal to 1 and less than or equal to N.

Description

Data deduplication method and device
Technical Field
The present application relates generally to the field of data deduplication, and more particularly, to a method of data deduplication and an apparatus therefor.
Background
With the rapid development of big data technology, various information systems in the society generate more and more data at all times. Due to the increasing volume of data sets, many classical algorithms are often not applied efficiently due to computation time. At the same time, more and more repeated or partially repeated data is present in modern data sets. If the data can be quickly and effectively deduplicated in the preprocessing stage, the same or similar data items can be effectively merged by the technicians in the field, so that the running time of the back-end algorithm is greatly reduced.
At present, it is a common practice in computer engineering practice to perform data deduplication by using a hash table. Because the hash deduplication algorithm has the excellent characteristic of average o (n) time complexity, the deduplication of data by using the hash table under the general data tends to have higher throughput than the deduplication of data by using the sorting algorithm. However, due to many limitations on hash computation, parallelization and cache control, the performance of the hash deduplication algorithm on the central processing unit is often unsatisfactory, and therefore, the deduplication itself becomes a bottleneck in the computation task in many cases.
In addition, due to the benefits of pipeline optimization and parallel computation, the implementation of the hash deduplication algorithm by using the FPGA in the prior art is also a quick and efficient choice in engineering. Pipeline optimization is a parallel optimization method commonly used in hardware acceleration, which divides a complex processing operation into multiple steps, and by overlapping operations on different steps, allows multiple operations to be performed in parallel, thereby producing at most an effective output per clock cycle. In addition, by utilizing a large amount of editable computing resources on the FPGA, a plurality of groups of processing engines can be generated to perform parallel computing simultaneously, so as to further improve the performance of the hash deduplication algorithm on the FPGA.
However, due to the storage structure of the chain table and data in the hash algorithm, the space size of the internal memory of the FPGA chip, and the limitation of the access mode of the external memory (DDR, etc.) of the FPGA chip, the pipeline in the hash deduplication algorithm in the FPGA chip often needs to generate a large amount of stalls due to wait operations. Since a pipelined system does not produce any valid output during a stall, such a stall can often affect system performance several times or even tens of times.
Disclosure of Invention
An exemplary embodiment of the present invention is to provide a data deduplication method and an apparatus thereof to solve at least the above problems of the prior art.
According to an exemplary embodiment of the present invention, a method for data deduplication may include performing the following operations on input data in a pipeline manner according to a preset pipeline gap of an FPGA chip in an FPGA board: performing hash calculation on the current input data by using the FPGA chip to obtain an addressing hash value and a storage hash value of the current input data; and comparing the storage hash value of the current input data with all storage hash values within a preset step length from the addressing hash value in the nth level storage stored in the N levels of storages by using the FPGA chip to determine whether the current input data is repeated data, wherein the front M levels of storages in the N levels of storages belong to an internal storage in the FPGA chip, the rear N-M levels of storages in the N levels of storages belong to an external storage outside the FPGA chip in the FPGA board card, N, M and N are positive integers, N is more than or equal to 2, M is more than or equal to 1 and less than or equal to N-1, and N is more than or equal to 1 and less than or equal to N.
Optionally, the step of comparing, by the FPGA chip, the stored hash value of the current input data with all stored hash values stored in an nth stage memory of the N stages of memories within a predetermined step size from the addressed hash value to determine whether the current input data is duplicate data may include: starting from the nth-1 level memory of the N-level memories, the following operations are performed: performing a first comparison of a stored hash value of current input data with all stored hash values stored in an nth-level memory of the N-level memories within a predetermined step size from the addressed hash value; determining that the current input data is duplicate data if the stored hash value of the current input data is the same as any one of all stored hash values stored in an nth-order memory of the N-order memories within a predetermined step size from the addressed hash value; if the stored hash value of the current input data is different from each of all stored hash values stored in an nth-stage memory of the N-stage memories within a predetermined step from the addressed hash value, making N +1 to re-perform the first comparison.
Optionally, the step of performing the first comparison may include: comparing the stored hash value of the current input data simultaneously with all stored hash values in the nth level memory within the predetermined step size from the addressed hash value according to an open addressing scheme.
Optionally, when the nth stage memory is one of the first N-1 stages of memories in the N stages of memories, if the stored hash value of the current input data is different from each of all stored hash values stored in the nth stage memory in the N stages of memories in a predetermined step from the addressed hash value, replacing one of all stored hash values stored in the nth stage memory in the N stages of memories in a predetermined step from the addressed hash value with the stored hash value of the current input data, and re-performing the first comparison for the nth +1 stage memory.
Optionally, when an nth one of the N-level memories is an nth level memory, determining that the current input data is not duplicate data if the stored hash value of the current input data is different from each of all stored hash values stored in the nth one of the N-level memories within a predetermined step size from the addressed hash value.
Optionally, when determining that the current input data is not duplicate data according to an nth one of the N-level memories, the method may further include: judging whether an empty item exists in the preset step length in the Nth-level memory; if an empty item exists in the preset step length in the Nth-level storage, storing the storage hash value of the current input data in the empty item; discarding the stored hash value of the current input data if there is no empty entry within the predetermined step size in the nth level memory.
Optionally, the preset pipeline gap may be adjusted according to an accuracy requirement of data deduplication and performance of memory read-write latency.
According to an exemplary embodiment of the present invention, there is provided an apparatus for data deduplication, which may include an FPGA board, wherein the FPGA board includes an FPGA chip and at least one external memory, the FPGA chip includes at least one computation engine, wherein each computation engine includes at least one hash calculator, at least one internal memory, and a controller, and each computation engine is configured to pipeline input data according to a preset pipeline gap of the FPGA chip: performing hash calculation on the current input data by a hash calculator in each calculation engine to obtain an addressing hash value and a storage hash value of the current input data; and comparing, by the controller in each of the computing engines, the stored hash value of the current input data with all stored hash values stored in an nth one of the N-level memories within a predetermined step size from the addressed hash value to determine whether the current input data is duplicate data, wherein a first M-level memory of the N-level memories belongs to the internal memory in each of the computing engines, a last N-M-level memory of the N-level memories belongs to the at least one external memory, N, M and N are positive integers, and N is greater than or equal to 2, 1M N-1, 1N N.
Optionally, each computing engine may be configured to determine whether current input data is duplicate data by starting from an nth-1 th level memory of the N-level memories: performing a first comparison of a stored hash value of current input data with all stored hash values stored in an nth-level memory of the N-level memories within a predetermined step size from the addressed hash value; determining that the current input data is duplicate data if the stored hash value of the current input data is the same as any one of all stored hash values stored in an nth-order memory of the N-order memories within a predetermined step size from the addressed hash value; if the stored hash value of the current input data is different from each of all stored hash values stored in an nth-stage memory of the N-stage memories within a predetermined step from the addressed hash value, making N +1 to re-perform the first comparison.
Optionally, each of the compute engines may be configured to perform the first comparison by: comparing the stored hash value of the current input data simultaneously with all stored hash values in the nth one of the N-level memories within the predetermined step size from the addressed hash value according to an open addressing scheme.
Optionally, when the nth stage memory is one of the first N-1 stages of memories in the N stages of memories, if the stored hash value of the current input data is different from each of all stored hash values stored in the nth stage memory in the N stages of memories within a predetermined step from the addressed hash value, each of the computation engines is configured to replace one of all stored hash values stored in the nth stage memory in the N stages of memories within a predetermined step from the addressed hash value with the stored hash value of the current input data, and to re-perform the first comparison for the nth +1 stage memory.
Optionally, when the nth of the N-level memories is an nth level memory, if the stored hash value of the current input data is different from each of all stored hash values stored in the nth of the N-level memories within a predetermined step size from the addressed hash value, the each computation engine is configured to determine that the current input data is not duplicate data.
Optionally, when determining from an nth one of the N-level memories that the current input data is not duplicate data, each of the compute engines may be further configured to: judging whether an empty item exists in the preset step length in the Nth-level memory; if an empty item exists in the preset step length in the Nth-level storage, storing the storage hash value of the current input data in the empty item; discarding the stored hash value of the current input data if there is no empty entry within the predetermined step size in the nth level memory.
Optionally, the preset pipeline gap can be adjusted according to the accuracy requirement of data deduplication and the performance of memory read-write delay.
The method for data deduplication and the device thereof according to the exemplary embodiments of the present application may implement a fast preliminary deduplication operation of a large amount of data.
Additional aspects and/or advantages of the present general inventive concept will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the general inventive concept.
Drawings
These and/or other aspects and advantages of the present application will become more apparent and more readily appreciated from the following detailed description of the embodiments of the present application, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a general flowchart illustrating a data deduplication method according to an exemplary embodiment of the present application;
FIG. 2 is an example of an FPGA board according to an exemplary embodiment of the present application;
FIG. 3 is a detailed flowchart illustrating a data deduplication method according to an exemplary embodiment of the present application;
FIG. 4 is a diagram illustrating an open addressing scheme with fixed step sizes according to an exemplary embodiment of the present application;
FIG. 5 is a schematic diagram illustrating pipeline slack adjustment;
fig. 6 is a block diagram illustrating a structure of a data deduplication apparatus according to an exemplary embodiment of the present application;
fig. 7 is a block diagram illustrating a structure of a calculation engine according to an exemplary embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments are described below in order to explain the present invention by referring to the figures.
Fig. 1 is a flowchart illustrating a data deduplication method according to an exemplary embodiment of the present application, wherein, in order to improve data deduplication efficiency, the present invention performs deduplication operations on input data in a pipeline manner according to a preset pipeline gap of an FPGA chip in an FPGA board, and therefore, when performing deduplication operations on input data, the data deduplication method performs the following steps S100 and S200 on the input data in a pipeline manner according to the preset pipeline gap of the FPGA chip in the FPGA board, thereby implementing data deduplication.
In step S100, the FPGA chip performs hash calculation on the current input data to obtain an addressing hash value and a storage hash value of the current input data.
Specifically, the conventional hash deduplication algorithm stores complete data in a hash table as key values, and the key values in the hash table are often stored in a linked list manner, but due to uncertainty of the length of the linked list, complexity of a pipeline system is often increased or a pipeline is often forced to stall, that is, the storage manner brings much uncertainty on timing in an addressing operation. In addition, for many common data sets, data items tend to be long or have variable lengths, and the length of a key value also affects the throughput performance of the deduplication system due to a large number of comparison operations in the hash deduplication algorithm, for example, for images, the conventional hash deduplication algorithm stores complete image data in a hash table as the key value, and obviously, the length of the data item in the hash table is long and varies according to the amount of image data of each image, so to solve these problems, the data deduplication method according to the present application adopts a new storage structure of hash table key values, that is, the complete data (e.g., image data, audio data, etc.) is not directly stored as the key value, but two hash values of the current input data are calculated, that is, an addressing hash value and a storage hash value, wherein the addressing hash value is used for addressing the memory, namely, the method is used for locating the key value stored in the memory, storing the hash value as the key value to be stored in the memory and being used for carrying out data comparison operation on the current input data, so that the number of comparison operation can be effectively reduced, and a large amount of computing resources and storage resources are saved. In addition, when the addressing hash value and the storage hash value of the current input data are calculated, the data deduplication method adopts two different hash algorithms to calculate the addressing hash value and the storage hash value.
In step S200, comparing the stored hash value of the current input data with all stored hash values stored in the nth level memory of the N levels of memories within a predetermined step size from the addressed hash value to determine whether the current input data is duplicated data, wherein the first M levels of memories of the N levels of memories belong to an internal memory (e.g. a cache) within the FPGA chip, the last N-M levels of memories of the N levels of memories belong to an external memory (e.g. a DDR, etc.) outside the FPGA chip in the FPGA board, N, M and N are positive integers, and N is greater than or equal to 2, 1 is greater than or equal to M is less than or equal to N-1, and 1 is greater than or equal to N.
For example, as shown in fig. 2, the FPGA board includes an FPGA chip and 3 external memories, the FPGA chip includes 2 computation engines, each computation engine may include 4 hash calculators, 3 internal memories, and 1 controller, all memories in the FPGA board are divided into 4-level memories, wherein the internal memory of each computation engine is divided into 1 st-level memory and 2 nd-level memory, and the external memory located outside the FPGA chip on the FPGA board is divided into 3 rd-level memory and 4 th-level memory, in other words, in the example in fig. 2, N is 4 and M is 2, but this is only an example, and the present invention is not limited thereto. The internal memory of the FPGA chip has small storage capacity, but the reading and writing speeds are very high, so that the query operation of repeated data can be accelerated and the deduplication accuracy is improved. Therefore, by mixedly allocating the multi-stage memories of different types inside or outside the FPGA chip, the present application can not only overcome the problem that a large amount of data cannot be processed due to the fast read/write speed but the small capacity of the internal memory of the FPGA chip, but also overcome the problem that the pipeline of the hash algorithm generates a huge gap due to the large capacity but the slow read/write speed of the external memory (such as DDR, etc.) outside the FPGA chip.
The data deduplication method illustrated in fig. 1 will be described in detail below with reference to fig. 3, and fig. 3 is a detailed flowchart illustrating the data deduplication method according to an exemplary embodiment of the present application.
As shown in fig. 3, in step S100, the FPGA chip performs a hash calculation on the current input data to obtain an addressing hash value and a storage hash value of the current input data. Since this has been described in detail above, it will not be described in detail here.
In step S201, the value of N is set to 1, that is, N is 1, that is, the subsequent operation is performed from the nth 1-stage memory in the N-stage memory on the FPGA board.
In step S202, it is determined whether N is less than or equal to N, that is, it is determined whether the nth-level memory is one of the N-level memories.
In step S203, a first comparison of comparing the stored hash value of the current input data with all stored hash values stored in the nth-stage memory among the N-stage memories within a predetermined step from the addressed hash value is performed, in other words, the first comparison is performed to determine whether the current input data is duplicate data.
Specifically, the predetermined step size may be set to different values for an internal memory of the FPGA chip and an external memory located outside the FPGA chip on the FPGA board, for example, the predetermined step size may be set to the same value for each of the N stages of memories, or may be set to different values, for example, the predetermined step size respectively set for a first N-1 stage of memories of the N stages of memories may be smaller than the predetermined step size set for an nth stage of memories, e.g., the predetermined step size may be set to 1 for the first N-1 stage of memories, and the predetermined step size may be set to a positive integer greater than 1 (e.g., 3, 4, 5, etc.) for the nth stage of memories, but this is merely an example, and the present invention is not limited thereto.
Wherein the step of performing the first comparison may comprise: comparing the stored hash value of the current input data simultaneously with all stored hash values in the nth level memory within the predetermined step size from the addressed hash value according to an open addressing scheme.
Specifically, the open addressing scheme adopted by the method adopts a fixed-step-size open addressing scheme, the fixed-step-size open addressing scheme can effectively utilize the advantage of parallel computing on an FPGA chip, all elements in a step size range are compared simultaneously, the fixed-step-size open addressing mode ensures the stability of addressing time, the design of a pipeline in an algorithm is greatly simplified, the pipeline pause is effectively avoided in a controllable accuracy range, and the Hash collision is effectively solved. For example, fig. 4 illustrates an example process of comparing a stored hash value of current input data simultaneously with all stored hash values in a predetermined step size from the addressed hash value in an nth level memory according to an open addressing scheme according to an exemplary embodiment of the present application. In the example shown in fig. 4, the predetermined step size set for the nth level memory is 4, and therefore, when performing the first comparison, the key value in the nth level memory is first address-located using the address hash value, for example, the address hash value 1 in fig. 4 is located to the position of the key value "store hash value 1" in the nth level memory, and then 4 store hash values within the predetermined step size from the position where the address hash value is located in the nth level memory are acquired according to the predetermined step size whose value is 4: "store hash value 1", "empty", "store hash value 2", and "store hash value 3", after which the store hash value of the current input data is simultaneously compared with the acquired 4 store hash values. The process shown in fig. 4 is merely exemplary, and the present invention is not limited thereto.
Referring back to fig. 3, if the stored hash value of the current input data is the same as any one of all stored hash values stored in the nth-level memory among the N-level memories within a predetermined step size from the addressed hash value, it is determined that the current input data is duplicated data at step S204. Specifically, if it is determined that the current input data is duplicate data, the data deduplication method directly discards the current input data and then performs step S100 for the next input data.
If the stored hash value of the current input data is different from each of all stored hash values stored in the nth stage memory of the N-th stage memory within a predetermined step from the addressed hash value, N is made N +1 to re-perform the first comparison at step S205.
Specifically, if it is determined that the stored hash value of the current input data is different from each of all stored hash values stored in the nth stage memory within a predetermined step from the addressed hash value, the data deduplication method will perform an operation similar to that already performed for the nth stage memory for the nth ═ N +1 stage memory, as shown in fig. 3, and after step S205, return to step S202 to determine whether N is less than or equal to N, that is, determine whether the operation for all N-stage memories has been completed.
In other words, when the nth-stage memory is one of the first N-1 stage memories in the N-stage memories (that is, when N is less than N), if the stored hash value of the current input data is different from each of all stored hash values stored in the nth-stage memory in the N-stage memories within a predetermined step size from the addressed hash value, replacing one stored hash value among all stored hash values stored in the nth-stage memory in the N-stage memories within a predetermined step size from the addressed hash value with the stored hash value of the current input data, and re-performing the first comparison for the nth + 1-stage memory, wherein the one stored hash value in the nth-stage memory replaced with the stored hash value of the current input data may be randomly selected or may be selected according to a predetermined replacement rule, for example, if the predetermined step size of the nth level memory is 1, the stored hash value located by the addressed hash value of the current input data in the nth level memory may be directly replaced with the stored hash value of the current input data.
However, when an nth-order memory of the N-order memories is an nth-order memory (that is, when N is equal to N), if the stored hash value of the current input data is different from each of all stored hash values stored in the nth-order memory of the N-order memories within a predetermined step size from the addressed hash value, it is determined that the current input data is not duplicated data, and the current input data may be output. In this case, the data deduplication method further includes: judging whether an empty item exists in the predetermined step length in the Nth-level memory, namely judging whether the empty item exists in the predetermined step length from the addressing hash value in the Nth-level memory; if an empty item exists in the preset step length in the Nth-level storage, storing the storage hash value of the current input data in the empty item; if no empty entry exists within the predetermined step in the Nth level memory, the stored hash value of the current input data is discarded. The nth-level memory is an external memory which is located outside the FPGA chip on the FPGA board, and the final hash table is stored in the nth-level memory.
In addition, as shown in fig. 5, if the same memory address is read and written among different operations, the read-write mode in fig. 5 may introduce data hazards to affect the accuracy of deduplication, so that the pipeline gap preset by the data deduplication method in the deduplication operation of the input data can be adjusted according to the accuracy requirement of data deduplication and the performance of memory read-write delay, thereby controlling the frequency of occurrence of data collision and balancing the deduplication accuracy and the pipeline performance.
Fig. 6 is a block diagram illustrating a structure of a data deduplication apparatus 100 according to an exemplary embodiment of the present application.
As shown in fig. 6, the data deduplication device 100 may include an FPGA card 110, wherein the FPGA card 110 may include an FPGA chip 111 and at least one external memory 112. Further, the FPGA chip 111 includes at least one computation engine 1110, wherein, as shown in fig. 7, each computation engine 1110 includes a controller 1111, at least one hash calculator 1112, at least one internal memory 1113.
Each compute engine 1110 may perform deduplication on current input data through a finite state machine, and pipeline optimization is applied to the entire deduplication operation, that is, each compute engine 1110 is configured to perform deduplication on input data in a pipeline manner according to a preset pipeline gap of the FPGA chip 111, and each compute engine 1110 may process a new piece of input data on average every clock cycle.
Specifically, each compute engine 1110 is configured to pipeline input data according to a preset pipeline gap of the FPGA chip 111 by: performing a hash calculation on the current input data by a hash calculator 1112 in each of the calculation engines 1110 to obtain an addressing hash value and a storage hash value of the current input data; and the stored hash value of the current input data is compared with all stored hash values stored in the nth stage memory of the N stages of memories within a predetermined step size from the addressed hash value by the controller 1111 of the each calculation engine 1110 to determine whether the current input data is duplicated data, wherein the former M stages of memories of the N stages belong to the internal memory 1113 of the each calculation engine 1110, the latter N-M stages of memories of the N stages belong to the at least one external memory 112, N, M and N are positive integers, and N.gtoreq.2, 1. ltoreq.M.ltoreq.N-1, 1. ltoreq.n.ltoreq.N.
How the calculation engine 1110 performs data deduplication will be described in detail below.
Specifically, each of the computing engines 1110 may be configured to determine whether current input data is duplicate data by starting from an nth-1 th-level memory of the N-level memories: first, the value of N is set to 1, that is, N is 1, that is, the subsequent operation is performed from the nth 1-level memory in the N-level memories on the FPGA board 110; then, judging whether N is less than or equal to N, namely judging whether the nth level memory is one level memory in the N level memories; thereafter, a first comparison comparing the stored hash value of the current input data with all stored hash values stored in an nth-level memory of the N-level memories within a predetermined step from the addressed hash value is performed, in other words, the first comparison is performed to determine whether the current input data is duplicated.
The predetermined step size may be set to different values for the internal memory 1113 of the FPGA chip 111 and the external memory 112 located outside the FPGA chip 111 on the FPGA board 110, for example, the predetermined step size may be set to the same value for each of the N-level memories, or may be set to different values, for example, the predetermined step sizes respectively set for the first N-1 level memories in the N-level memories may be smaller than the predetermined step size set for the nth level memory, for example, the predetermined step size may be set to 1 for the first N-1 level memories, and the predetermined step size may be set to a positive integer (e.g., 3, 4, 5, etc.) greater than 1 for the nth level memories, but this is merely an example, and the present invention is not limited thereto.
Wherein each of the compute engines 1110 may be configured to perform the first comparison by: comparing the stored hash value of the current input data simultaneously with all stored hash values in the nth one of the N-level memories within the predetermined step size from the addressed hash value according to an open addressing scheme. Since the fixed-step open addressing scheme has been described in detail above with reference to fig. 4, it is not repeated here.
After performing the first comparison, according to a result of the first comparison, if the stored hash value of the current input data is the same as any one of all stored hash values stored in the nth stage memory of the N-th stage memories within a predetermined step from the addressed hash value, each of the calculation engines 1110 may determine that the current input data is duplicate data. When it is determined that the current input data is duplicate data, each of the calculation engines 1110 directly discards the current input data, calculates an addressing hash value and a storage hash value for the next input data, and determines whether the next input data is duplicate data according to the similar process described above.
If the stored hash value of the current input data is different from each of all stored hash values stored in the nth-stage memory among the N-stage memories within a predetermined step from the addressed hash value, the each calculation engine 1110 makes N +1 to re-perform the first comparison.
Specifically, if it is determined that the stored hash value of the current input data is different from each of all stored hash values stored in the nth stage memory within a predetermined step from the addressed hash value, each of the calculation engines 1110 performs an operation similar to that already performed for the nth stage memory for the nth-N +1 stage memory, that is, re-determines whether N is less than or equal to N, that is, determines whether the operations for all N-stage memories have been completed.
In other words, when the nth-stage memory is one of the first N-1 stage memories in the N-stage memories (that is, when N is less than N), if the stored hash value of the current input data is different from each of all stored hash values stored in the nth-stage memory in the N-stage memories within a predetermined step from the addressed hash value, the each calculation engine 1110 may be configured to replace one stored hash value of all stored hash values stored in the nth-stage memory in the N-stage memories within a predetermined step from the addressed hash value with the stored hash value of the current input data and to re-perform the first comparison for the nth + 1-stage memory, wherein the one stored hash value in the nth-stage memory replaced with the stored hash value of the current input data may be randomly selected, it may also be selected according to a predetermined replacement rule, for example, if the predetermined step size of the nth level memory is 1, the stored hash value located by the addressed hash value of the current input data in the nth level memory may be directly replaced with the stored hash value of the current input data.
However, when the nth of the N-th order memories is the nth order memory (that is, when N is equal to N), if the stored hash value of the current input data is different from each of all stored hash values stored in the nth of the N-th order memories within a predetermined step from the addressed hash value, each of the calculation engines 1110 may be configured to determine that the current input data is not the duplicate data. In this case, each of the computing engines 1110 may be further configured to: judging whether an empty item exists in the preset step length in the Nth-level memory; if an empty item exists in the preset step length in the Nth-level storage, storing the storage hash value of the current input data in the empty item; if no empty entry exists within the predetermined step in the Nth level memory, the stored hash value of the current input data is discarded. The nth-level memory is an external memory 112 located outside the FPGA chip 111 on the FPGA board 110, and the final hash table is stored in the nth-level memory, and since the preliminary fast deduplication of data is to be implemented in the present application, although a situation that the stored hash value corresponding to the unrepeated data is not stored in the hash table may occur, this does not affect the implementation of the purpose of the preliminary fast deduplication of the present invention.
In addition, the pipeline clearance that data deduplication device 100 of this application predetermines when carrying out the deduplication operation to input data can be adjusted according to the accuracy demand of data deduplication and the performance of memory read-write delay to the frequency that control data conflict appears makes and can get the balance in the deduplication accuracy and pipeline performance.
In addition, although not shown in fig. 6, the data deduplication apparatus 100 may further include a central processing unit, a main memory and a system bus, wherein the central processing unit is responsible for collecting input data, writing the input data into the main memory, and sending a control signal to coordinate the operation of the FPGA card 110. After receiving the control instruction, the FPGA board 100 transmits data in the main memory to the FPGA board 110 through the system bus, and performs a deduplication operation on the input data. After completing one or more data transmission and deduplication operations, under the control of the central processing unit, the FPGA board 110 may write the deduplicated data back to the main memory through the system bus, thereby completing a rapid preliminary deduplication operation of a large amount of data required by the scene.
According to the data deduplication method and the data deduplication device, the number of comparison operations in the deduplication algorithm can be effectively reduced by changing the storage structure of the key values in the hash table (i.e., calculating the addressing hash value and the storage hash value for each input data, and storing the storage hash value as the key value), and the application overcomes the possible stalls in the pipeline by using the open addressing scheme with the fixed step size. In addition, the design of a flexible and configurable multi-stage and multi-computing engine can be realized according to the resource amount on the FPGA board card, the difference of the types and the performances of the internal memory and the external memory of the FPGA chip and the performance requirements under different scenes, the design can effectively mix and allocate the multi-stage and different types of memories positioned inside or outside the FPGA chip on the FPGA board card, and the assembly line clearance of algorithm operation on different memories can be adjusted, so that the running performance of data deduplication operation is improved, and the performance reduction caused by the access delay of the external memory can be greatly compensated.
In addition, according to the data deduplication method and the data deduplication device, the application can realize rapid preliminary deduplication of data in the preprocessing stage of the data, which is helpful for merging the same or similar data items, thereby greatly reducing the running time of a back-end algorithm.
While exemplary embodiments of the invention have been described above, it should be understood that the above description is illustrative only and not exhaustive, and that the invention is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. Therefore, the protection scope of the present invention should be subject to the scope of the claims.

Claims (10)

1. A method of data deduplication, the method comprising: carrying out the following operations on input data according to a pipeline mode according to a preset pipeline gap of an FPGA chip in an FPGA board card:
performing hash calculation on the current input data by using the FPGA chip to obtain an addressing hash value and a storage hash value of the current input data; and
comparing, with the FPGA chip, the stored hash value of the current input data with all stored hash values stored in an nth level memory of the N levels of memories within a predetermined step size from the addressed hash value to determine whether the current input data is duplicate data,
the front M-level memory in the N-level memories belongs to an internal memory in the FPGA chip, the rear N-M-level memory in the N-level memories belongs to an external memory outside the FPGA chip in the FPGA board card, N, M and N are positive integers, N is more than or equal to 2, M is more than or equal to 1 and less than or equal to N-1, and N is more than or equal to 1 and less than or equal to N.
2. The method of claim 1, wherein comparing, with the FPGA chip, the stored hash value of the current input data to all stored hash values stored in an nth level of the N levels of memory within a predetermined step size from the addressed hash value to determine whether the current input data is duplicate data comprises: starting from the nth-1 level memory of the N-level memories, the following operations are performed:
performing a first comparison of a stored hash value of current input data with all stored hash values stored in an nth-level memory of the N-level memories within a predetermined step size from the addressed hash value;
determining that the current input data is duplicate data if the stored hash value of the current input data is the same as any one of all stored hash values stored in an nth-order memory of the N-order memories within a predetermined step size from the addressed hash value;
if the stored hash value of the current input data is different from each of all stored hash values stored in an nth-stage memory of the N-stage memories within a predetermined step from the addressed hash value, making N +1 to re-perform the first comparison.
3. The method of claim 2, wherein performing the first comparison comprises:
comparing the stored hash value of the current input data simultaneously with all stored hash values in the nth level memory within the predetermined step size from the addressed hash value according to an open addressing scheme.
4. The method of claim 3, wherein when the nth-level memory is one of the first N-1 levels of memory in the N-level memories, if the stored hash value of the current input data is different from each of all stored hash values stored in the nth-level memory in the N-level memories within a predetermined step size from the addressed hash value, replacing one of all stored hash values stored in the nth-level memory in the N-level memories within a predetermined step size from the addressed hash value with the stored hash value of the current input data, and re-performing the first comparison for the nth + 1-level memory.
5. The method of claim 3, wherein when the nth of the N-level memories is the Nth-level memory, determining that the current input data is not duplicate data if the stored hash value of the current input data is different from each of all stored hash values stored in the nth of the N-level memories within a predetermined step size from the addressed hash value.
6. The method of claim 5, wherein when it is determined from an Nth one of the N-level memories that current input data is not duplicate data, the method further comprises:
judging whether an empty item exists in the preset step length in the Nth-level memory;
if an empty item exists in the preset step length in the Nth-level storage, storing the storage hash value of the current input data in the empty item;
discarding the stored hash value of the current input data if there is no empty entry within the predetermined step size in the nth level memory.
7. The method of claim 1, wherein the predetermined pipeline gap can be adjusted according to data deduplication accuracy requirements and performance of memory read and write latency.
8. An apparatus for data deduplication, the apparatus comprising an FPGA card, wherein the FPGA card comprises an FPGA chip and at least one external memory, the FPGA chip comprises at least one compute engine, wherein each compute engine comprises at least one hash calculator, at least one internal memory, and a controller, and each compute engine is configured to pipeline input data according to a preset pipeline gap of the FPGA chip by:
performing hash calculation on the current input data by a hash calculator in each calculation engine to obtain an addressing hash value and a storage hash value of the current input data; and is
Comparing, by the controller in each of the computing engines, the stored hash value of the current input data with all stored hash values stored in an nth one of the N-level memories within a predetermined step size from the addressed hash value to determine whether the current input data is duplicate data, wherein a first M-level memory of the N-level memories belongs to the internal memory of the each computing engine, a last N-M-level memory of the N-level memories belongs to the at least one external memory, N, M and N are positive integers, and N is greater than or equal to 2, 1M N-1, 1N.
9. The apparatus of claim 8, wherein each compute engine is configured to determine whether current input data is duplicate data starting from an nth-1 th one of the N-level memories:
performing a first comparison of a stored hash value of current input data with all stored hash values stored in an nth-level memory of the N-level memories within a predetermined step size from the addressed hash value;
determining that the current input data is duplicate data if the stored hash value of the current input data is the same as any one of all stored hash values stored in an nth-order memory of the N-order memories within a predetermined step size from the addressed hash value;
if the stored hash value of the current input data is different from each of all stored hash values stored in an nth-stage memory of the N-stage memories within a predetermined step from the addressed hash value, making N +1 to re-perform the first comparison.
10. The apparatus of claim 9, wherein each compute engine is configured to perform the first comparison by: comparing the stored hash value of the current input data simultaneously with all stored hash values in the nth one of the N-level memories within the predetermined step size from the addressed hash value according to an open addressing scheme.
CN201911338293.3A 2019-12-23 2019-12-23 Method and device for data deduplication Active CN111124312B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911338293.3A CN111124312B (en) 2019-12-23 2019-12-23 Method and device for data deduplication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911338293.3A CN111124312B (en) 2019-12-23 2019-12-23 Method and device for data deduplication

Publications (2)

Publication Number Publication Date
CN111124312A true CN111124312A (en) 2020-05-08
CN111124312B CN111124312B (en) 2023-10-31

Family

ID=70501185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911338293.3A Active CN111124312B (en) 2019-12-23 2019-12-23 Method and device for data deduplication

Country Status (1)

Country Link
CN (1) CN111124312B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010029571A1 (en) * 2000-04-05 2001-10-11 Atsufumi Shibayama Memory device
JP2009123050A (en) * 2007-11-16 2009-06-04 Nec Electronics Corp Information retrieving device and registration method of entry information to the same
CN102194002A (en) * 2011-05-25 2011-09-21 中兴通讯股份有限公司 Table entry adding, deleting and searching method of hash table and hash table storage device
US20110320700A1 (en) * 2010-06-24 2011-12-29 International Business Machines Corporation Concurrent Refresh In Cache Memory
CN102682116A (en) * 2012-05-14 2012-09-19 中兴通讯股份有限公司 Method and device for processing table items based on Hash table
CN103997346A (en) * 2014-05-12 2014-08-20 东南大学 Data matching method and device based on assembly line
CN107632789A (en) * 2017-09-29 2018-01-26 郑州云海信息技术有限公司 Method, system and Data duplication detection method are deleted based on distributed storage again
CN107644081A (en) * 2017-09-21 2018-01-30 锐捷网络股份有限公司 Data duplicate removal method and device
CN109086133A (en) * 2018-07-06 2018-12-25 第四范式(北京)技术有限公司 Managing internal memory data and the method and system for safeguarding data in memory
CN109189995A (en) * 2018-07-16 2019-01-11 哈尔滨理工大学 Data disappear superfluous method in cloud storage based on MPI
CN109582598A (en) * 2018-12-13 2019-04-05 武汉中元华电软件有限公司 A kind of preprocess method for realizing efficient lookup Hash table based on external storage
CN109800180A (en) * 2017-11-17 2019-05-24 爱思开海力士有限公司 Method and storage system for address of cache

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010029571A1 (en) * 2000-04-05 2001-10-11 Atsufumi Shibayama Memory device
JP2009123050A (en) * 2007-11-16 2009-06-04 Nec Electronics Corp Information retrieving device and registration method of entry information to the same
US20110320700A1 (en) * 2010-06-24 2011-12-29 International Business Machines Corporation Concurrent Refresh In Cache Memory
CN102194002A (en) * 2011-05-25 2011-09-21 中兴通讯股份有限公司 Table entry adding, deleting and searching method of hash table and hash table storage device
CN102682116A (en) * 2012-05-14 2012-09-19 中兴通讯股份有限公司 Method and device for processing table items based on Hash table
CN103997346A (en) * 2014-05-12 2014-08-20 东南大学 Data matching method and device based on assembly line
CN107644081A (en) * 2017-09-21 2018-01-30 锐捷网络股份有限公司 Data duplicate removal method and device
CN107632789A (en) * 2017-09-29 2018-01-26 郑州云海信息技术有限公司 Method, system and Data duplication detection method are deleted based on distributed storage again
CN109800180A (en) * 2017-11-17 2019-05-24 爱思开海力士有限公司 Method and storage system for address of cache
CN109086133A (en) * 2018-07-06 2018-12-25 第四范式(北京)技术有限公司 Managing internal memory data and the method and system for safeguarding data in memory
CN109189995A (en) * 2018-07-16 2019-01-11 哈尔滨理工大学 Data disappear superfluous method in cloud storage based on MPI
CN109582598A (en) * 2018-12-13 2019-04-05 武汉中元华电软件有限公司 A kind of preprocess method for realizing efficient lookup Hash table based on external storage

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
LI DONGYANG 等: "Hardware accelerator for similarity based data dedupe" *
MARIO RUIZ等: "An FPGA-based approach for packet deduplication in 100 gigabit-per-second networks" *
XIA WEN 等: "P-Dedupe: exploiting parallelism in data deduplication system" *
刘华楠;: "重复数据删除专利技术综述" *
刘竹松;杨张杰;: "基于布隆过滤器所有权证明的高效安全可去重云存储方案" *

Also Published As

Publication number Publication date
CN111124312B (en) 2023-10-31

Similar Documents

Publication Publication Date Title
US11262982B2 (en) Computation circuit including a plurality of processing elements coupled to a common accumulator, a computation device and a system including the same
US9990412B2 (en) Data driven parallel sorting system and method
CN109426484B (en) Data sorting device, method and chip
US10831738B2 (en) Parallelized in-place radix sorting
US8868835B2 (en) Cache control apparatus, and cache control method
US20190188239A1 (en) Dual phase matrix-vector multiplication system
CN109993293B (en) Deep learning accelerator suitable for heap hourglass network
CN112015366B (en) Data sorting method, data sorting device and database system
CN111930923B (en) Bloom filter system and filtering method
CN112085644A (en) Multi-column data sorting method and device, readable storage medium and electronic equipment
US20100146242A1 (en) Data processing apparatus and method of controlling the data processing apparatus
CN116227599A (en) Inference model optimization method and device, electronic equipment and storage medium
US20200117505A1 (en) Memory processor-based multiprocessing architecture and operation method thereof
US9823896B2 (en) Parallelized in-place radix sorting
US11791822B2 (en) Programmable device for processing data set and method for processing data set
CN111124312B (en) Method and device for data deduplication
Wang et al. Improved intermediate data management for mapreduce frameworks
CN110008382B (en) Method, system and equipment for determining TopN data
CN113966532A (en) Content addressable storage device, method and related equipment
US20210082082A1 (en) Data processing method and processing circuit
CN113986980A (en) Data sorting method and device
CN112817735A (en) Computing device, computing device and method for thread group accumulation
CN114730295A (en) Mode-based cache block compression
US20140310461A1 (en) Optimized and parallel processing methods with application to query evaluation
US20230409337A1 (en) Partial sorting for coherency recovery

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant