CN111124312A

CN111124312A - Data deduplication method and device

Info

Publication number: CN111124312A
Application number: CN201911338293.3A
Authority: CN
Inventors: 李嘉树; 季成; 卢冕
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2020-05-08
Anticipated expiration: 2039-12-23
Also published as: CN111124312B

Abstract

A method of data deduplication and an apparatus thereof are provided. The method comprises the following steps: carrying out the following operations on input data according to a pipeline mode according to a preset pipeline clearance of an FPGA chip in an FPGA board card: performing hash calculation on current input data by using an FPGA chip to obtain an addressing hash value and a storage hash value of the current input data; comparing the storage hash value of the current input data with all storage hash values in a preset step length from the addressing hash value in an nth-level memory stored in an N-level memory by using an FPGA chip to determine whether the current input data is repeated data, wherein a front M-level memory in the N-level memory belongs to an internal memory in the FPGA chip, a rear N-M-level memory in the N-level memory belongs to an external memory outside the FPGA chip in an FPGA board card, N, M and N are positive integers, N is more than or equal to 2, M is more than or equal to 1 and less than or equal to N-1, and N is more than or equal to 1 and less than or equal to N.

Description

Data deduplication method and device

Technical Field

The present application relates generally to the field of data deduplication, and more particularly, to a method of data deduplication and an apparatus therefor.

Background

With the rapid development of big data technology, various information systems in the society generate more and more data at all times. Due to the increasing volume of data sets, many classical algorithms are often not applied efficiently due to computation time. At the same time, more and more repeated or partially repeated data is present in modern data sets. If the data can be quickly and effectively deduplicated in the preprocessing stage, the same or similar data items can be effectively merged by the technicians in the field, so that the running time of the back-end algorithm is greatly reduced.

At present, it is a common practice in computer engineering practice to perform data deduplication by using a hash table. Because the hash deduplication algorithm has the excellent characteristic of average o (n) time complexity, the deduplication of data by using the hash table under the general data tends to have higher throughput than the deduplication of data by using the sorting algorithm. However, due to many limitations on hash computation, parallelization and cache control, the performance of the hash deduplication algorithm on the central processing unit is often unsatisfactory, and therefore, the deduplication itself becomes a bottleneck in the computation task in many cases.

In addition, due to the benefits of pipeline optimization and parallel computation, the implementation of the hash deduplication algorithm by using the FPGA in the prior art is also a quick and efficient choice in engineering. Pipeline optimization is a parallel optimization method commonly used in hardware acceleration, which divides a complex processing operation into multiple steps, and by overlapping operations on different steps, allows multiple operations to be performed in parallel, thereby producing at most an effective output per clock cycle. In addition, by utilizing a large amount of editable computing resources on the FPGA, a plurality of groups of processing engines can be generated to perform parallel computing simultaneously, so as to further improve the performance of the hash deduplication algorithm on the FPGA.

However, due to the storage structure of the chain table and data in the hash algorithm, the space size of the internal memory of the FPGA chip, and the limitation of the access mode of the external memory (DDR, etc.) of the FPGA chip, the pipeline in the hash deduplication algorithm in the FPGA chip often needs to generate a large amount of stalls due to wait operations. Since a pipelined system does not produce any valid output during a stall, such a stall can often affect system performance several times or even tens of times.

Disclosure of Invention

An exemplary embodiment of the present invention is to provide a data deduplication method and an apparatus thereof to solve at least the above problems of the prior art.

According to an exemplary embodiment of the present invention, a method for data deduplication may include performing the following operations on input data in a pipeline manner according to a preset pipeline gap of an FPGA chip in an FPGA board: performing hash calculation on the current input data by using the FPGA chip to obtain an addressing hash value and a storage hash value of the current input data; and comparing the storage hash value of the current input data with all storage hash values within a preset step length from the addressing hash value in the nth level storage stored in the N levels of storages by using the FPGA chip to determine whether the current input data is repeated data, wherein the front M levels of storages in the N levels of storages belong to an internal storage in the FPGA chip, the rear N-M levels of storages in the N levels of storages belong to an external storage outside the FPGA chip in the FPGA board card, N, M and N are positive integers, N is more than or equal to 2, M is more than or equal to 1 and less than or equal to N-1, and N is more than or equal to 1 and less than or equal to N.

Optionally, the step of comparing, by the FPGA chip, the stored hash value of the current input data with all stored hash values stored in an nth stage memory of the N stages of memories within a predetermined step size from the addressed hash value to determine whether the current input data is duplicate data may include: starting from the nth-1 level memory of the N-level memories, the following operations are performed: performing a first comparison of a stored hash value of current input data with all stored hash values stored in an nth-level memory of the N-level memories within a predetermined step size from the addressed hash value; determining that the current input data is duplicate data if the stored hash value of the current input data is the same as any one of all stored hash values stored in an nth-order memory of the N-order memories within a predetermined step size from the addressed hash value; if the stored hash value of the current input data is different from each of all stored hash values stored in an nth-stage memory of the N-stage memories within a predetermined step from the addressed hash value, making N +1 to re-perform the first comparison.

Optionally, the step of performing the first comparison may include: comparing the stored hash value of the current input data simultaneously with all stored hash values in the nth level memory within the predetermined step size from the addressed hash value according to an open addressing scheme.

Optionally, when the nth stage memory is one of the first N-1 stages of memories in the N stages of memories, if the stored hash value of the current input data is different from each of all stored hash values stored in the nth stage memory in the N stages of memories in a predetermined step from the addressed hash value, replacing one of all stored hash values stored in the nth stage memory in the N stages of memories in a predetermined step from the addressed hash value with the stored hash value of the current input data, and re-performing the first comparison for the nth +1 stage memory.

Optionally, when an nth one of the N-level memories is an nth level memory, determining that the current input data is not duplicate data if the stored hash value of the current input data is different from each of all stored hash values stored in the nth one of the N-level memories within a predetermined step size from the addressed hash value.

Optionally, when determining that the current input data is not duplicate data according to an nth one of the N-level memories, the method may further include: judging whether an empty item exists in the preset step length in the Nth-level memory; if an empty item exists in the preset step length in the Nth-level storage, storing the storage hash value of the current input data in the empty item; discarding the stored hash value of the current input data if there is no empty entry within the predetermined step size in the nth level memory.

Optionally, the preset pipeline gap may be adjusted according to an accuracy requirement of data deduplication and performance of memory read-write latency.

According to an exemplary embodiment of the present invention, there is provided an apparatus for data deduplication, which may include an FPGA board, wherein the FPGA board includes an FPGA chip and at least one external memory, the FPGA chip includes at least one computation engine, wherein each computation engine includes at least one hash calculator, at least one internal memory, and a controller, and each computation engine is configured to pipeline input data according to a preset pipeline gap of the FPGA chip: performing hash calculation on the current input data by a hash calculator in each calculation engine to obtain an addressing hash value and a storage hash value of the current input data; and comparing, by the controller in each of the computing engines, the stored hash value of the current input data with all stored hash values stored in an nth one of the N-level memories within a predetermined step size from the addressed hash value to determine whether the current input data is duplicate data, wherein a first M-level memory of the N-level memories belongs to the internal memory in each of the computing engines, a last N-M-level memory of the N-level memories belongs to the at least one external memory, N, M and N are positive integers, and N is greater than or equal to 2, 1M N-1, 1N N.

Optionally, each computing engine may be configured to determine whether current input data is duplicate data by starting from an nth-1 th level memory of the N-level memories: performing a first comparison of a stored hash value of current input data with all stored hash values stored in an nth-level memory of the N-level memories within a predetermined step size from the addressed hash value; determining that the current input data is duplicate data if the stored hash value of the current input data is the same as any one of all stored hash values stored in an nth-order memory of the N-order memories within a predetermined step size from the addressed hash value; if the stored hash value of the current input data is different from each of all stored hash values stored in an nth-stage memory of the N-stage memories within a predetermined step from the addressed hash value, making N +1 to re-perform the first comparison.

Optionally, each of the compute engines may be configured to perform the first comparison by: comparing the stored hash value of the current input data simultaneously with all stored hash values in the nth one of the N-level memories within the predetermined step size from the addressed hash value according to an open addressing scheme.

Optionally, when the nth stage memory is one of the first N-1 stages of memories in the N stages of memories, if the stored hash value of the current input data is different from each of all stored hash values stored in the nth stage memory in the N stages of memories within a predetermined step from the addressed hash value, each of the computation engines is configured to replace one of all stored hash values stored in the nth stage memory in the N stages of memories within a predetermined step from the addressed hash value with the stored hash value of the current input data, and to re-perform the first comparison for the nth +1 stage memory.

Optionally, when the nth of the N-level memories is an nth level memory, if the stored hash value of the current input data is different from each of all stored hash values stored in the nth of the N-level memories within a predetermined step size from the addressed hash value, the each computation engine is configured to determine that the current input data is not duplicate data.

Optionally, when determining from an nth one of the N-level memories that the current input data is not duplicate data, each of the compute engines may be further configured to: judging whether an empty item exists in the preset step length in the Nth-level memory; if an empty item exists in the preset step length in the Nth-level storage, storing the storage hash value of the current input data in the empty item; discarding the stored hash value of the current input data if there is no empty entry within the predetermined step size in the nth level memory.

Optionally, the preset pipeline gap can be adjusted according to the accuracy requirement of data deduplication and the performance of memory read-write delay.

The method for data deduplication and the device thereof according to the exemplary embodiments of the present application may implement a fast preliminary deduplication operation of a large amount of data.

Additional aspects and/or advantages of the present general inventive concept will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the general inventive concept.

Drawings

These and/or other aspects and advantages of the present application will become more apparent and more readily appreciated from the following detailed description of the embodiments of the present application, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a general flowchart illustrating a data deduplication method according to an exemplary embodiment of the present application;

FIG. 2 is an example of an FPGA board according to an exemplary embodiment of the present application;

FIG. 3 is a detailed flowchart illustrating a data deduplication method according to an exemplary embodiment of the present application;

FIG. 4 is a diagram illustrating an open addressing scheme with fixed step sizes according to an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram illustrating pipeline slack adjustment;

fig. 6 is a block diagram illustrating a structure of a data deduplication apparatus according to an exemplary embodiment of the present application;

fig. 7 is a block diagram illustrating a structure of a calculation engine according to an exemplary embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments are described below in order to explain the present invention by referring to the figures.

Fig. 1 is a flowchart illustrating a data deduplication method according to an exemplary embodiment of the present application, wherein, in order to improve data deduplication efficiency, the present invention performs deduplication operations on input data in a pipeline manner according to a preset pipeline gap of an FPGA chip in an FPGA board, and therefore, when performing deduplication operations on input data, the data deduplication method performs the following steps S100 and S200 on the input data in a pipeline manner according to the preset pipeline gap of the FPGA chip in the FPGA board, thereby implementing data deduplication.

In step S100, the FPGA chip performs hash calculation on the current input data to obtain an addressing hash value and a storage hash value of the current input data.

Specifically, the conventional hash deduplication algorithm stores complete data in a hash table as key values, and the key values in the hash table are often stored in a linked list manner, but due to uncertainty of the length of the linked list, complexity of a pipeline system is often increased or a pipeline is often forced to stall, that is, the storage manner brings much uncertainty on timing in an addressing operation. In addition, for many common data sets, data items tend to be long or have variable lengths, and the length of a key value also affects the throughput performance of the deduplication system due to a large number of comparison operations in the hash deduplication algorithm, for example, for images, the conventional hash deduplication algorithm stores complete image data in a hash table as the key value, and obviously, the length of the data item in the hash table is long and varies according to the amount of image data of each image, so to solve these problems, the data deduplication method according to the present application adopts a new storage structure of hash table key values, that is, the complete data (e.g., image data, audio data, etc.) is not directly stored as the key value, but two hash values of the current input data are calculated, that is, an addressing hash value and a storage hash value, wherein the addressing hash value is used for addressing the memory, namely, the method is used for locating the key value stored in the memory, storing the hash value as the key value to be stored in the memory and being used for carrying out data comparison operation on the current input data, so that the number of comparison operation can be effectively reduced, and a large amount of computing resources and storage resources are saved. In addition, when the addressing hash value and the storage hash value of the current input data are calculated, the data deduplication method adopts two different hash algorithms to calculate the addressing hash value and the storage hash value.

In step S200, comparing the stored hash value of the current input data with all stored hash values stored in the nth level memory of the N levels of memories within a predetermined step size from the addressed hash value to determine whether the current input data is duplicated data, wherein the first M levels of memories of the N levels of memories belong to an internal memory (e.g. a cache) within the FPGA chip, the last N-M levels of memories of the N levels of memories belong to an external memory (e.g. a DDR, etc.) outside the FPGA chip in the FPGA board, N, M and N are positive integers, and N is greater than or equal to 2, 1 is greater than or equal to M is less than or equal to N-1, and 1 is greater than or equal to N.

For example, as shown in fig. 2, the FPGA board includes an FPGA chip and 3 external memories, the FPGA chip includes 2 computation engines, each computation engine may include 4 hash calculators, 3 internal memories, and 1 controller, all memories in the FPGA board are divided into 4-level memories, wherein the internal memory of each computation engine is divided into 1 st-level memory and 2 nd-level memory, and the external memory located outside the FPGA chip on the FPGA board is divided into 3 rd-level memory and 4 th-level memory, in other words, in the example in fig. 2, N is 4 and M is 2, but this is only an example, and the present invention is not limited thereto. The internal memory of the FPGA chip has small storage capacity, but the reading and writing speeds are very high, so that the query operation of repeated data can be accelerated and the deduplication accuracy is improved. Therefore, by mixedly allocating the multi-stage memories of different types inside or outside the FPGA chip, the present application can not only overcome the problem that a large amount of data cannot be processed due to the fast read/write speed but the small capacity of the internal memory of the FPGA chip, but also overcome the problem that the pipeline of the hash algorithm generates a huge gap due to the large capacity but the slow read/write speed of the external memory (such as DDR, etc.) outside the FPGA chip.

The data deduplication method illustrated in fig. 1 will be described in detail below with reference to fig. 3, and fig. 3 is a detailed flowchart illustrating the data deduplication method according to an exemplary embodiment of the present application.

As shown in fig. 3, in step S100, the FPGA chip performs a hash calculation on the current input data to obtain an addressing hash value and a storage hash value of the current input data. Since this has been described in detail above, it will not be described in detail here.

In step S201, the value of N is set to 1, that is, N is 1, that is, the subsequent operation is performed from the nth 1-stage memory in the N-stage memory on the FPGA board.

In step S202, it is determined whether N is less than or equal to N, that is, it is determined whether the nth-level memory is one of the N-level memories.

In step S203, a first comparison of comparing the stored hash value of the current input data with all stored hash values stored in the nth-stage memory among the N-stage memories within a predetermined step from the addressed hash value is performed, in other words, the first comparison is performed to determine whether the current input data is duplicate data.

Specifically, the predetermined step size may be set to different values for an internal memory of the FPGA chip and an external memory located outside the FPGA chip on the FPGA board, for example, the predetermined step size may be set to the same value for each of the N stages of memories, or may be set to different values, for example, the predetermined step size respectively set for a first N-1 stage of memories of the N stages of memories may be smaller than the predetermined step size set for an nth stage of memories, e.g., the predetermined step size may be set to 1 for the first N-1 stage of memories, and the predetermined step size may be set to a positive integer greater than 1 (e.g., 3, 4, 5, etc.) for the nth stage of memories, but this is merely an example, and the present invention is not limited thereto.

Wherein the step of performing the first comparison may comprise: comparing the stored hash value of the current input data simultaneously with all stored hash values in the nth level memory within the predetermined step size from the addressed hash value according to an open addressing scheme.

Specifically, the open addressing scheme adopted by the method adopts a fixed-step-size open addressing scheme, the fixed-step-size open addressing scheme can effectively utilize the advantage of parallel computing on an FPGA chip, all elements in a step size range are compared simultaneously, the fixed-step-size open addressing mode ensures the stability of addressing time, the design of a pipeline in an algorithm is greatly simplified, the pipeline pause is effectively avoided in a controllable accuracy range, and the Hash collision is effectively solved. For example, fig. 4 illustrates an example process of comparing a stored hash value of current input data simultaneously with all stored hash values in a predetermined step size from the addressed hash value in an nth level memory according to an open addressing scheme according to an exemplary embodiment of the present application. In the example shown in fig. 4, the predetermined step size set for the nth level memory is 4, and therefore, when performing the first comparison, the key value in the nth level memory is first address-located using the address hash value, for example, the address hash value 1 in fig. 4 is located to the position of the key value "store hash value 1" in the nth level memory, and then 4 store hash values within the predetermined step size from the position where the address hash value is located in the nth level memory are acquired according to the predetermined step size whose value is 4: "store hash value 1", "empty", "store hash value 2", and "store hash value 3", after which the store hash value of the current input data is simultaneously compared with the acquired 4 store hash values. The process shown in fig. 4 is merely exemplary, and the present invention is not limited thereto.

Referring back to fig. 3, if the stored hash value of the current input data is the same as any one of all stored hash values stored in the nth-level memory among the N-level memories within a predetermined step size from the addressed hash value, it is determined that the current input data is duplicated data at step S204. Specifically, if it is determined that the current input data is duplicate data, the data deduplication method directly discards the current input data and then performs step S100 for the next input data.

If the stored hash value of the current input data is different from each of all stored hash values stored in the nth stage memory of the N-th stage memory within a predetermined step from the addressed hash value, N is made N +1 to re-perform the first comparison at step S205.

Specifically, if it is determined that the stored hash value of the current input data is different from each of all stored hash values stored in the nth stage memory within a predetermined step from the addressed hash value, the data deduplication method will perform an operation similar to that already performed for the nth stage memory for the nth ═ N +1 stage memory, as shown in fig. 3, and after step S205, return to step S202 to determine whether N is less than or equal to N, that is, determine whether the operation for all N-stage memories has been completed.

In other words, when the nth-stage memory is one of the first N-1 stage memories in the N-stage memories (that is, when N is less than N), if the stored hash value of the current input data is different from each of all stored hash values stored in the nth-stage memory in the N-stage memories within a predetermined step size from the addressed hash value, replacing one stored hash value among all stored hash values stored in the nth-stage memory in the N-stage memories within a predetermined step size from the addressed hash value with the stored hash value of the current input data, and re-performing the first comparison for the nth + 1-stage memory, wherein the one stored hash value in the nth-stage memory replaced with the stored hash value of the current input data may be randomly selected or may be selected according to a predetermined replacement rule, for example, if the predetermined step size of the nth level memory is 1, the stored hash value located by the addressed hash value of the current input data in the nth level memory may be directly replaced with the stored hash value of the current input data.

However, when an nth-order memory of the N-order memories is an nth-order memory (that is, when N is equal to N), if the stored hash value of the current input data is different from each of all stored hash values stored in the nth-order memory of the N-order memories within a predetermined step size from the addressed hash value, it is determined that the current input data is not duplicated data, and the current input data may be output. In this case, the data deduplication method further includes: judging whether an empty item exists in the predetermined step length in the Nth-level memory, namely judging whether the empty item exists in the predetermined step length from the addressing hash value in the Nth-level memory; if an empty item exists in the preset step length in the Nth-level storage, storing the storage hash value of the current input data in the empty item; if no empty entry exists within the predetermined step in the Nth level memory, the stored hash value of the current input data is discarded. The nth-level memory is an external memory which is located outside the FPGA chip on the FPGA board, and the final hash table is stored in the nth-level memory.

In addition, as shown in fig. 5, if the same memory address is read and written among different operations, the read-write mode in fig. 5 may introduce data hazards to affect the accuracy of deduplication, so that the pipeline gap preset by the data deduplication method in the deduplication operation of the input data can be adjusted according to the accuracy requirement of data deduplication and the performance of memory read-write delay, thereby controlling the frequency of occurrence of data collision and balancing the deduplication accuracy and the pipeline performance.

Fig. 6 is a block diagram illustrating a structure of a data deduplication apparatus 100 according to an exemplary embodiment of the present application.

As shown in fig. 6, the data deduplication device 100 may include an FPGA card 110, wherein the FPGA card 110 may include an FPGA chip 111 and at least one external memory 112. Further, the FPGA chip 111 includes at least one computation engine 1110, wherein, as shown in fig. 7, each computation engine 1110 includes a controller 1111, at least one hash calculator 1112, at least one internal memory 1113.

Each compute engine 1110 may perform deduplication on current input data through a finite state machine, and pipeline optimization is applied to the entire deduplication operation, that is, each compute engine 1110 is configured to perform deduplication on input data in a pipeline manner according to a preset pipeline gap of the FPGA chip 111, and each compute engine 1110 may process a new piece of input data on average every clock cycle.

Specifically, each compute engine 1110 is configured to pipeline input data according to a preset pipeline gap of the FPGA chip 111 by: performing a hash calculation on the current input data by a hash calculator 1112 in each of the calculation engines 1110 to obtain an addressing hash value and a storage hash value of the current input data; and the stored hash value of the current input data is compared with all stored hash values stored in the nth stage memory of the N stages of memories within a predetermined step size from the addressed hash value by the controller 1111 of the each calculation engine 1110 to determine whether the current input data is duplicated data, wherein the former M stages of memories of the N stages belong to the internal memory 1113 of the each calculation engine 1110, the latter N-M stages of memories of the N stages belong to the at least one external memory 112, N, M and N are positive integers, and N.gtoreq.2, 1. ltoreq.M.ltoreq.N-1, 1. ltoreq.n.ltoreq.N.

How the calculation engine 1110 performs data deduplication will be described in detail below.

Specifically, each of the computing engines 1110 may be configured to determine whether current input data is duplicate data by starting from an nth-1 th-level memory of the N-level memories: first, the value of N is set to 1, that is, N is 1, that is, the subsequent operation is performed from the nth 1-level memory in the N-level memories on the FPGA board 110; then, judging whether N is less than or equal to N, namely judging whether the nth level memory is one level memory in the N level memories; thereafter, a first comparison comparing the stored hash value of the current input data with all stored hash values stored in an nth-level memory of the N-level memories within a predetermined step from the addressed hash value is performed, in other words, the first comparison is performed to determine whether the current input data is duplicated.

The predetermined step size may be set to different values for the internal memory 1113 of the FPGA chip 111 and the external memory 112 located outside the FPGA chip 111 on the FPGA board 110, for example, the predetermined step size may be set to the same value for each of the N-level memories, or may be set to different values, for example, the predetermined step sizes respectively set for the first N-1 level memories in the N-level memories may be smaller than the predetermined step size set for the nth level memory, for example, the predetermined step size may be set to 1 for the first N-1 level memories, and the predetermined step size may be set to a positive integer (e.g., 3, 4, 5, etc.) greater than 1 for the nth level memories, but this is merely an example, and the present invention is not limited thereto.

Wherein each of the compute engines 1110 may be configured to perform the first comparison by: comparing the stored hash value of the current input data simultaneously with all stored hash values in the nth one of the N-level memories within the predetermined step size from the addressed hash value according to an open addressing scheme. Since the fixed-step open addressing scheme has been described in detail above with reference to fig. 4, it is not repeated here.

After performing the first comparison, according to a result of the first comparison, if the stored hash value of the current input data is the same as any one of all stored hash values stored in the nth stage memory of the N-th stage memories within a predetermined step from the addressed hash value, each of the calculation engines 1110 may determine that the current input data is duplicate data. When it is determined that the current input data is duplicate data, each of the calculation engines 1110 directly discards the current input data, calculates an addressing hash value and a storage hash value for the next input data, and determines whether the next input data is duplicate data according to the similar process described above.

If the stored hash value of the current input data is different from each of all stored hash values stored in the nth-stage memory among the N-stage memories within a predetermined step from the addressed hash value, the each calculation engine 1110 makes N +1 to re-perform the first comparison.

Specifically, if it is determined that the stored hash value of the current input data is different from each of all stored hash values stored in the nth stage memory within a predetermined step from the addressed hash value, each of the calculation engines 1110 performs an operation similar to that already performed for the nth stage memory for the nth-N +1 stage memory, that is, re-determines whether N is less than or equal to N, that is, determines whether the operations for all N-stage memories have been completed.

In other words, when the nth-stage memory is one of the first N-1 stage memories in the N-stage memories (that is, when N is less than N), if the stored hash value of the current input data is different from each of all stored hash values stored in the nth-stage memory in the N-stage memories within a predetermined step from the addressed hash value, the each calculation engine 1110 may be configured to replace one stored hash value of all stored hash values stored in the nth-stage memory in the N-stage memories within a predetermined step from the addressed hash value with the stored hash value of the current input data and to re-perform the first comparison for the nth + 1-stage memory, wherein the one stored hash value in the nth-stage memory replaced with the stored hash value of the current input data may be randomly selected, it may also be selected according to a predetermined replacement rule, for example, if the predetermined step size of the nth level memory is 1, the stored hash value located by the addressed hash value of the current input data in the nth level memory may be directly replaced with the stored hash value of the current input data.

However, when the nth of the N-th order memories is the nth order memory (that is, when N is equal to N), if the stored hash value of the current input data is different from each of all stored hash values stored in the nth of the N-th order memories within a predetermined step from the addressed hash value, each of the calculation engines 1110 may be configured to determine that the current input data is not the duplicate data. In this case, each of the computing engines 1110 may be further configured to: judging whether an empty item exists in the preset step length in the Nth-level memory; if an empty item exists in the preset step length in the Nth-level storage, storing the storage hash value of the current input data in the empty item; if no empty entry exists within the predetermined step in the Nth level memory, the stored hash value of the current input data is discarded. The nth-level memory is an external memory 112 located outside the FPGA chip 111 on the FPGA board 110, and the final hash table is stored in the nth-level memory, and since the preliminary fast deduplication of data is to be implemented in the present application, although a situation that the stored hash value corresponding to the unrepeated data is not stored in the hash table may occur, this does not affect the implementation of the purpose of the preliminary fast deduplication of the present invention.

In addition, the pipeline clearance that data deduplication device 100 of this application predetermines when carrying out the deduplication operation to input data can be adjusted according to the accuracy demand of data deduplication and the performance of memory read-write delay to the frequency that control data conflict appears makes and can get the balance in the deduplication accuracy and pipeline performance.

In addition, although not shown in fig. 6, the data deduplication apparatus 100 may further include a central processing unit, a main memory and a system bus, wherein the central processing unit is responsible for collecting input data, writing the input data into the main memory, and sending a control signal to coordinate the operation of the FPGA card 110. After receiving the control instruction, the FPGA board 100 transmits data in the main memory to the FPGA board 110 through the system bus, and performs a deduplication operation on the input data. After completing one or more data transmission and deduplication operations, under the control of the central processing unit, the FPGA board 110 may write the deduplicated data back to the main memory through the system bus, thereby completing a rapid preliminary deduplication operation of a large amount of data required by the scene.

According to the data deduplication method and the data deduplication device, the number of comparison operations in the deduplication algorithm can be effectively reduced by changing the storage structure of the key values in the hash table (i.e., calculating the addressing hash value and the storage hash value for each input data, and storing the storage hash value as the key value), and the application overcomes the possible stalls in the pipeline by using the open addressing scheme with the fixed step size. In addition, the design of a flexible and configurable multi-stage and multi-computing engine can be realized according to the resource amount on the FPGA board card, the difference of the types and the performances of the internal memory and the external memory of the FPGA chip and the performance requirements under different scenes, the design can effectively mix and allocate the multi-stage and different types of memories positioned inside or outside the FPGA chip on the FPGA board card, and the assembly line clearance of algorithm operation on different memories can be adjusted, so that the running performance of data deduplication operation is improved, and the performance reduction caused by the access delay of the external memory can be greatly compensated.

In addition, according to the data deduplication method and the data deduplication device, the application can realize rapid preliminary deduplication of data in the preprocessing stage of the data, which is helpful for merging the same or similar data items, thereby greatly reducing the running time of a back-end algorithm.

While exemplary embodiments of the invention have been described above, it should be understood that the above description is illustrative only and not exhaustive, and that the invention is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. Therefore, the protection scope of the present invention should be subject to the scope of the claims.

Claims

1. A method of data deduplication, the method comprising: carrying out the following operations on input data according to a pipeline mode according to a preset pipeline gap of an FPGA chip in an FPGA board card:

performing hash calculation on the current input data by using the FPGA chip to obtain an addressing hash value and a storage hash value of the current input data; and

comparing, with the FPGA chip, the stored hash value of the current input data with all stored hash values stored in an nth level memory of the N levels of memories within a predetermined step size from the addressed hash value to determine whether the current input data is duplicate data,

the front M-level memory in the N-level memories belongs to an internal memory in the FPGA chip, the rear N-M-level memory in the N-level memories belongs to an external memory outside the FPGA chip in the FPGA board card, N, M and N are positive integers, N is more than or equal to 2, M is more than or equal to 1 and less than or equal to N-1, and N is more than or equal to 1 and less than or equal to N.

2. The method of claim 1, wherein comparing, with the FPGA chip, the stored hash value of the current input data to all stored hash values stored in an nth level of the N levels of memory within a predetermined step size from the addressed hash value to determine whether the current input data is duplicate data comprises: starting from the nth-1 level memory of the N-level memories, the following operations are performed:

performing a first comparison of a stored hash value of current input data with all stored hash values stored in an nth-level memory of the N-level memories within a predetermined step size from the addressed hash value;

determining that the current input data is duplicate data if the stored hash value of the current input data is the same as any one of all stored hash values stored in an nth-order memory of the N-order memories within a predetermined step size from the addressed hash value;

if the stored hash value of the current input data is different from each of all stored hash values stored in an nth-stage memory of the N-stage memories within a predetermined step from the addressed hash value, making N +1 to re-perform the first comparison.

3. The method of claim 2, wherein performing the first comparison comprises:

comparing the stored hash value of the current input data simultaneously with all stored hash values in the nth level memory within the predetermined step size from the addressed hash value according to an open addressing scheme.

4. The method of claim 3, wherein when the nth-level memory is one of the first N-1 levels of memory in the N-level memories, if the stored hash value of the current input data is different from each of all stored hash values stored in the nth-level memory in the N-level memories within a predetermined step size from the addressed hash value, replacing one of all stored hash values stored in the nth-level memory in the N-level memories within a predetermined step size from the addressed hash value with the stored hash value of the current input data, and re-performing the first comparison for the nth + 1-level memory.

5. The method of claim 3, wherein when the nth of the N-level memories is the Nth-level memory, determining that the current input data is not duplicate data if the stored hash value of the current input data is different from each of all stored hash values stored in the nth of the N-level memories within a predetermined step size from the addressed hash value.

6. The method of claim 5, wherein when it is determined from an Nth one of the N-level memories that current input data is not duplicate data, the method further comprises:

judging whether an empty item exists in the preset step length in the Nth-level memory;

if an empty item exists in the preset step length in the Nth-level storage, storing the storage hash value of the current input data in the empty item;

discarding the stored hash value of the current input data if there is no empty entry within the predetermined step size in the nth level memory.

7. The method of claim 1, wherein the predetermined pipeline gap can be adjusted according to data deduplication accuracy requirements and performance of memory read and write latency.

8. An apparatus for data deduplication, the apparatus comprising an FPGA card, wherein the FPGA card comprises an FPGA chip and at least one external memory, the FPGA chip comprises at least one compute engine, wherein each compute engine comprises at least one hash calculator, at least one internal memory, and a controller, and each compute engine is configured to pipeline input data according to a preset pipeline gap of the FPGA chip by:

performing hash calculation on the current input data by a hash calculator in each calculation engine to obtain an addressing hash value and a storage hash value of the current input data; and is

Comparing, by the controller in each of the computing engines, the stored hash value of the current input data with all stored hash values stored in an nth one of the N-level memories within a predetermined step size from the addressed hash value to determine whether the current input data is duplicate data, wherein a first M-level memory of the N-level memories belongs to the internal memory of the each computing engine, a last N-M-level memory of the N-level memories belongs to the at least one external memory, N, M and N are positive integers, and N is greater than or equal to 2, 1M N-1, 1N.

9. The apparatus of claim 8, wherein each compute engine is configured to determine whether current input data is duplicate data starting from an nth-1 th one of the N-level memories:

10. The apparatus of claim 9, wherein each compute engine is configured to perform the first comparison by: comparing the stored hash value of the current input data simultaneously with all stored hash values in the nth one of the N-level memories within the predetermined step size from the addressed hash value according to an open addressing scheme.