WO2015061995A1

WO2015061995A1 - Data processing method, device, and duplication processor

Info

Publication number: WO2015061995A1
Application number: PCT/CN2013/086253
Authority: WO
Inventors: 于传帅; 张程伟; 张宗全; 林春恭; 游俊; 刘强
Original assignee: 华为技术有限公司
Priority date: 2013-10-30
Filing date: 2013-10-30
Publication date: 2015-05-07
Also published as: CN103930890B; CN103930890A

Abstract

Embodiments of the present invention, by indexing data primarily according to index values corresponding to data of a first query unit, where the first query unit comprises multiple minimum data blocks, and by extracting some bits from each minimum data block to compose an index value corresponding to the first query unit, greatly reduce index matching time, increase index matching efficiency, and at the same time, make possible a significant reduction in the amount of memory occupied by indexes.

Description

Data processing method, device and deduplication processor

Embodiments of the present invention relate to storage technologies, and in particular, to a data processing method, apparatus, and deduplication processor.

Background technique

Deduplication, also known as smart compression or single instance storage, is an automatic search for duplicate data, leaving only the same copy of the same data, and replacing other duplicates with pointers to a single copy to eliminate redundancy. Data, storage technology that reduces storage capacity requirements.

In the prior art, the deduplication method can employ a fixed length blocking algorithm.

The fingerprint algorithm is used to calculate the fingerprint of the data object in the sliding window. If the predetermined condition is met, the starting position and the ending position of the sliding window are used as the boundary of the data block, and the data object is segmented by continuously sliding the window and calculating the fingerprint. For each data block obtained by dividing, it is necessary to first determine whether the data block is greater than the length lower limit value, and if greater than the length lower limit value, calculate the fingerprint value of the data block, such as a hash value, and the fingerprint stored in the storage device. Value comparison, if the fingerprint value of the data block is the same as a certain fingerprint value stored in the storage device, it indicates that the data block is a duplicate data block, and the same data block as the data block has been stored in the storage device, therefore, The data object may refer to the stored data block in the storage device. If the fingerprint value of the data block does not exist in the storage device, the data block and its fingerprint value may be stored in the storage device for subsequent use. Repeated data judgment.

However, the inventors have found that in the prior art deduplication method, the index occupies a large amount of memory, and the storage requirement in which the amount of data is gradually increased cannot be applied. Summary of the invention

Embodiments of the present invention provide a data processing method, apparatus, and a deduplication processor, which reduce memory usage and meet the increasing demand for data. In a first aspect, an embodiment of the present invention provides a data processing method, where the method is applied to a data processing system, where the data processing system includes a deduplication processor, and the method includes: the deduplication processor will slide a window Covering data that requires repeated data search as a first query unit, the first query unit includes a plurality of minimum data blocks, and the minimum data block is a data block of a minimum query unit for performing repeated data search; The data in a query unit is indexed and the data is searched. The index structure includes: extracting a partial bit from each of the minimum data blocks in the first query unit, and extracting the extracted bits. An index value of a preset length corresponding to the first query unit;

The repeated data search includes: querying, in a preset index table, whether there is an index value that is the same as the index value corresponding to the first query unit, and if the index table is queried to correspond to the first query unit And searching for the first index value with the same index value, and searching whether the data in the first query unit has data that is repeated with the target data pointed to by the data storage address corresponding to the first index value. With reference to the first aspect, the embodiment of the present invention provides a first possible implementation manner, where the method further includes: if the data in the first query unit and the data storage address corresponding to the first index value point to a target The data is completely repeated: the data before the start position of the sliding window is used as the second query unit, and the previous is the needle For the reverse direction of the sliding of the sliding window, the second query unit includes at least one minimum data block, and one index of the preset length is constructed according to the at least one minimum data block in the second query unit. a value, in the index table, whether there is a second index value that is the same as the index value corresponding to the second query unit;

If the second index value that is the same as the index value corresponding to the second query unit is queried in the index table, whether the data in the second query unit has data corresponding to the second index value is found. The data of the target data pointed to by the storage address is duplicated.

In conjunction with the first possible manner of the first aspect, the embodiment of the present invention further provides a second possible method, where the method further includes:

If the second index value that is the same as the index value corresponding to the second query unit is not found in the index table, the index value corresponding to the second query unit is compared with the data in the second query unit. A correspondence between the storage addresses is inserted into the index table. With reference to the first aspect, the embodiment of the present invention provides a third possible manner, if the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely duplicated, the method further includes : determining whether the data size before the start position of the sliding window reaches the size of the first query unit, where the previous direction is for the opposite direction of the sliding window sliding, and if not, the preset step size The sliding window is slid, and the data in the sliding window after sliding is used as a first query unit, and the step of constructing the index and the step of searching for the repeated data are performed. In conjunction with the first aspect or the first possible manner of the first aspect or the second possible manner of the first aspect, the embodiment of the present invention further provides a fourth possible manner, if no query is found in the index table. Description The first query unit corresponds to the first index value with the same index value, and the method further includes:

Determining whether the size of the data before the start position of the sliding window reaches the size of the first query unit, where the previous direction is for the opposite direction of sliding of the sliding window, and if not, sliding at a preset step size The sliding window performs the step of constructing the index and the step of searching for the repeated data by using the data in the sliding window after sliding as the first query unit. With reference to the fourth possible manner of the first aspect, the embodiment of the present invention further provides a fifth possible manner, if the first index value that is the same as the index value corresponding to the first query unit is not queried in the index table. Before determining whether the size of the data before the start position of the sliding window reaches the size of the first query unit, the method further includes: matching, in the index table, the index value corresponding to the first query unit a third index value equal to or higher than a preset matching degree, if the index table does not have a third index value that matches the index value of the first query unit equal to or higher than a preset matching degree, Then, the positional values of the numerical values in the index values corresponding to the first query unit are sequentially arranged and combined, and it is determined whether the index value corresponding to the first query unit after the array combination is found in the index table. The index value, if not found, enters the step of determining whether the data before the start position of the sliding window reaches the size of the first query unit.

With reference to the fifth possible manner of the first aspect, the embodiment of the present invention provides a sixth possible implementation manner, if the first index value corresponding to the first query unit is found in the index table is equal to or If the third index value is higher than the preset matching degree, it is searched whether the data in the first query unit has data duplicated by the target data pointed to by the data storage address corresponding to the third index value. In conjunction with the sixth possible manner of the first aspect, the embodiment of the present invention further provides a seventh possible method. , also includes:

Obtaining a duplicate data search result, if the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely duplicated, the data storage address corresponding to the first index value Pointing target data, performing delta compression on the data in the first query unit, storing the data after the delta compression is completed, and the index value corresponding to the first query unit and the data storage address after the delta compression The corresponding relationship is inserted into the index table. In a second aspect, an embodiment of the present invention provides a data processing apparatus, including: an index construction unit, configured to: an index structure, where the index structure includes: data that is covered by a sliding window in data that requires repeated data query as a first a query unit, which extracts a partial bit from the fingerprint value of each of the smallest data blocks in the first query unit, and the extracted bits constitute an index value of a preset length corresponding to the first query unit, where The first query unit includes a plurality of minimum data blocks, where the minimum data block is a data block of a minimum query unit for performing repeated data search; and an index matching unit is configured to query whether the data is in a preset index table. The first query unit corresponds to an index value with the same index value;

a data search unit, configured to: if the index matching unit searches for the first index value that is the same as the index value corresponding to the first query unit in the index table, whether the data in the first query unit is searched for There is data that is repeated with the target data pointed to by the data storage address corresponding to the first index value. With reference to the second aspect, the embodiment of the present invention provides a first implementation manner, where the repeated data searching unit obtains the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value. Repeat, then The index construction unit is further configured to use data before the start position of the sliding window as a second query unit, where the previous query is for the opposite direction of the sliding window sliding, according to the second query. The at least one minimum data block in the unit constructs an index value of the preset length, where the second query unit includes at least one minimum data block; the index matching unit is further used in the index table Whether the middle query has a second index value that is the same as the index value corresponding to the second query unit;

The duplicate data searching unit is further configured to: if the index matching unit searches for the second index value that is the same as the index value corresponding to the second query unit in the index table, searching for the second query unit Whether there is data in the data that is duplicated by the target data pointed to by the data storage address corresponding to the second index value. With reference to the first possible manner of the second aspect, the embodiment of the present invention provides a second possible manner, if the index matching unit does not query the index value corresponding to the second query unit in the index table. The second index value further includes: a first storage unit, configured to store data in the second query unit;

The first index table updating unit is configured to insert a correspondence between an index value corresponding to the data of the second query unit and a storage address of the data in the second query unit into the preset index table. With reference to the second aspect, the data processing apparatus of the embodiment of the present invention further provides a third possible manner, if the duplicate data searching unit searches for data in the first query unit and a data storage address corresponding to the first index value. The pointing target data is not completely repeated, and the device further includes: a first determining unit, configured to find, by the duplicate data searching unit, the first query unit When the target data and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, determining data before the start position of the sliding window, where the previous direction is for the opposite direction of sliding of the sliding window Whether the size of the first query unit is reached; the first instruction unit is configured to: before the first determining unit determines that the sliding window determines the start position of the sliding window, the data does not reach the first query unit When the size is small, the sliding window is slid in a preset step size;

The index construction unit is further configured to construct the index by using the data in the sliding window after sliding as a first query unit. With reference to the first aspect or the first possible manner of the first aspect or the second mode of the first aspect, the data processing apparatus of the embodiment of the present invention provides a fourth type, if the index matching unit is in the index The device does not query the first index value that is the same as the index value corresponding to the first query unit, and the device further includes: a second determining unit, configured to search, in the first query unit, by the duplicate data searching unit When the data and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, the data before the start position of the sliding window is determined, and the previous direction is for the opposite direction of sliding of the sliding window. Whether the size of the first query unit is reached; the second instruction unit is configured to: when the second determining unit determines that the sliding window determines the start position of the sliding window, the data size does not reach the first query unit The sliding window is slid in a preset step size; the index construction unit is further used to slide the sliding window The data is constructed as a first query unit. With reference to the fourth mode of the second aspect, the embodiment of the present invention further provides a fifth possible manner, if the index matching unit does not query the same index value corresponding to the first query unit in the index table. The first index value is further used to query, in the index table, a third index value that matches the index value corresponding to the first query unit with a matching degree equal to or higher than a preset matching degree; And the third query value corresponding to the index matching value of the first query unit is equal to or higher than the preset matching degree, and the positions of the numerical values in the index values corresponding to the first query unit are sequentially arranged and combined, and the judgment is performed. And determining, by the index value corresponding to the first query unit, whether the same third index value is found in the index table;

The second determining unit is further configured to: when the index matching unit finally fails to match the third index value, determine a data size before the start position of the sliding window, where the previous is for the sliding Whether the size of the first query unit is reached in the reverse direction of the window sliding, and the result is sent to the second instruction unit.

With reference to the fifth possible manner of the second aspect, the embodiment of the present invention provides the sixth possible manner, the repeated data searching unit is further configured to: if the index matching unit queries the index table in the index table If the first index value corresponding to the first query unit is equal to or higher than the third index value of the preset matching degree, it is searched whether the data in the first query unit corresponds to the third index value. The data storage address points to the duplicate data of the target data. With reference to the sixth possible manner of the second aspect, the embodiment of the present invention further provides a seventh possible manner, where the method further includes: a delta compression unit, configured to obtain a duplicate data search result of the duplicate data search unit, if The data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely overlapped, and the data storage address corresponding to the first index value points to Marking data, performing delta compression on the data in the first query unit; the second storage unit is further configured to store data after the delta compression is completed; the second index table updating unit is further configured to: The correspondence between the index value corresponding to the first query unit and the data storage address after the delta compression is inserted into the index table.

In the embodiment of the present invention, the index is indexed by the index value corresponding to the data of the first query unit, and the first query unit includes a plurality of minimum data blocks, and is formed by taking some bits from each minimum data block. Corresponding to the index value of the first query unit, the index matching time is greatly reduced, the index matching efficiency is improved, and the memory occupancy of the index is also greatly reduced.

DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any inventive labor.

FIG. 1 is a schematic structural diagram of a data processing system according to an embodiment of the present invention; FIG.

FIG. 1 is a schematic diagram of another structure of a data processing system according to an embodiment of the present invention; FIG. 2 is a flowchart of a data processing method according to an embodiment of the present invention;

2A is a schematic diagram of an index structure provided by an embodiment of the present invention;

3 is a flowchart of another data processing method according to an embodiment of the present invention;

4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a deduplication processor according to an embodiment of the present invention. detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

The embodiment of the present invention provides a data processing system, where the data processing system includes at least one de-duplication processor and at least one storage node, and a de-duplication processor and a storage node are deployed. As shown in Figure 1-A, mode 1: Each deduplication processor is connected to the storage node through a network. The deduplication processor can be deployed as a software or a separate hardware device or integrated on other hardware devices, and deployed in the user. side;

Alternatively, as shown in FIG. 1-B, the de-duplication processor may be integrated on the storage node as a hardware device, or may be deployed as a software function module on the storage node, and processed after receiving the data sent by the user. A data processing system 10 provided in an embodiment of the present invention includes at least one deduplication processor 101, 102, 10n, and a plurality of storage nodes 111, 112, 11n; each deduplication processor receives a user transmission through an interface. The incoming data may be a standard protocol interface, such as an NFS protocol interface.

In this embodiment, an index table is preset in the data processing system, where the index table includes a correspondence between an index value corresponding to data already stored in the storage node and a data storage address, and the index table It can be stored in each deduplication processor or in a storage node.

Each of the deduplication processors and each storage node in the data processing system are connected, such as a network connection or other manner of connection. 2 is a flowchart of a data processing method according to an embodiment of the present invention. As shown in FIG. 2, the method of the embodiment of the present invention is applied to a data processing system, including:

Step 201: The deduplication processor uses data in the sliding window in the data of the repeated data query as a rated size data block, the nominal size data block includes a plurality of minimum data blocks, and the minimum data block is used for performing repeated data. The data block of the smallest query unit found; a nominal size data block in the sliding window is used as a first query unit;

In the actual operation, the minimum data block is a data block of a minimum query unit in a repeated data search, and is usually 4 KB. If it is a variable component block, the minimum query unit size is about 4 KB, and the minimum query is in the embodiment of the present invention. The size of the data block of the unit is not limited; for convenience of description, the data in the sliding window is used as the first query unit;

Step 202: Perform an index construction, including: extracting, from the fingerprint values of each of the minimum data blocks in the first query unit, a partial bit, and forming the extracted bits into a pre-corresponding to the first query unit. Set the index value of the length;

In the specific configuration index, the fingerprint value of each of the smallest data blocks in the first query unit is obtained, and the preset number of the same number of bits is extracted from the fingerprint value corresponding to each minimum data block. All the extracted bits are grouped into an index value corresponding to the first query unit. As shown in FIG. 2-A, the size of the rated data block is 32 KB, including 8 minimum data blocks of 4 KB, respectively. Five bits are obtained from the fingerprint values corresponding to each of the smallest data blocks, and the extracted bits are combined into an index value of 40 bits.

The embodiment of the present invention is not limited to whether the bit of the same number of bits is extracted from each of the smallest data blocks, or the number of the extracted bits is used to form an index, and only needs to be based on the length of the set index. It is determined flexibly, and the length of the index is set according to the actual situation, which is not limited in the embodiment of the present invention. Step 203: Query whether there is an index value that is the same as the index value corresponding to the first query unit in a preset index table, and if the index value corresponding to the first query unit is the same in the index table, An index value, proceeds to step 204;

For facilitating the subsequent description, the first query unit is matched to the same index value in the index table as the first index value;

Depending on where the index table is stored, the specific steps of the deduplication processor in performing the index query in step 203 are different:

When the pre-set index table is stored in the deduplication processor, the deduplication processor may query, in the local index table, whether there is a first index value index value that is the same as the index value corresponding to the first query unit, and obtain search result;

When the pre-set index table is stored in the storage node, the de-duplication processor may send the index value corresponding to the first query unit to the storage node, and the storage node queries the index table whether there is a corresponding to the first query unit. The first index value with the same index value, the deduplication processor receives the query result fed back by the storage node.

Step 204: Search for data of the size of the data in the first query unit that has duplicated target data pointed to by the data storage address corresponding to the first index value.

The method for specifically searching in step 204 may be: loading target data pointed to by the data storage address corresponding to the first index value into a data search repeat included in the deduplication processor and the first query unit Transmitting the data included in the first query unit to the storage node where the target data is located and the target data to search for duplicate data, or may be corresponding to the target data pointed to by the data storage address corresponding to the first index value. The fingerprint value is compared with the fingerprint value corresponding to the data included in the first query unit to find the duplicate data, and the specific manner is not limited in the embodiment of the present invention. Set.

In the embodiment of the present invention, the rated size data block includes a plurality of minimum data blocks, and the plurality of minimum data blocks are constructed into an index, which greatly reduces the number of indexes.

In order to reduce the occurrence of the offset of the smallest data block, in the method embodiment, if the first index value that is the same as the index value corresponding to the first query unit is not found in the index table, Go to step 205;

Step 205, determining whether the size of the data before the start position of the sliding window reaches the size of the nominal size data block, where the previous direction is to the opposite direction of sliding of the sliding window, and if not, proceeding to step 206; , then proceeds to step 207;

In the embodiment of the method, after each data is repeatedly searched, if a part of the data to be queried and the already stored data are duplicated, the data to be queried needs to be deleted, and the data already stored is increased. Reference counting and the like, if a part of the data to be queried is determined to be new data, the part of the data is stored, and part of the data in the queried data is deleted or stored so that it appears in the queried data. The data breakpoint, in the description of the embodiment before the sliding window starting position, is preceded by the opposite direction of sliding of the sliding window.

Step 206, sliding the sliding window in a preset step, the data in the sliding window after sliding as a first query unit, return to step 202;

Step 207: Store data before the start position of the sliding window as new data, which is previously for the opposite direction of sliding of the sliding window;

Because the rated size data block in the sliding window does not find the same index in the index table, that is, the data in the sliding window cannot find duplicate data, in order to avoid possible storage of data There is only a certain length of offset from the data in the sliding window. Therefore, in the embodiment of the present invention, the sliding window is slid by one step and then searched, and the data size before the sliding window reaches the data block of the rated size, then , you can know that the sliding window has been slid by the length of a data block of the rated size, and the data of this length is also the data covered by the previous sliding window, because the previous sliding window covers the data and has already judged the corresponding data. The index can not find the same index in the index table, therefore, at this time, the data before the sliding window can be directly stored as new data, where the sliding direction is in front of the sliding window;

Step 208: Insert the obtained index value corresponding to the new data and the corresponding relationship of the new data storage address into the index table.

Wherein, as mentioned above, the new data is the data covered by the previous sliding window, and when the sliding window covers the data, the index value corresponding to the data has been calculated, and if the index values are saved, then Obtaining the index value of the new data directly; if not, the index corresponding to the new data may be obtained according to the foregoing method for obtaining the corresponding index of the first query unit. The index table may be blank at the beginning, and is continuously updated by continuously inserting the corresponding relationship between the index value corresponding to the new data and the new data storage address in the subsequent repeated data search process, where the new data is also found. Non-repeating data.

In the embodiment of the present invention, an index is constructed according to a plurality of minimum data blocks in the sliding window. Therefore, when the data in the sliding window, that is, the data in the first query unit, matches the same first index value in the index table. When the data pointed to by the data address corresponding to the first index value is compared with the data in the first query unit, it is determined whether the data in the first query unit is duplicated with the already stored data. Before the comparison of the data, it is possible to judge whether the data is repeated by comparing the fingerprint values corresponding to the data in the prior art. In the embodiment of the present invention, when the fingerprint value of the data in the first query unit and the fingerprint value corresponding to the target data cannot be completely overlapped, the reason for avoiding is that the stored data may only be related to the data in the sliding window. A certain offset, and therefore, when it is not completely repetitive, the embodiment of the present invention slides the sliding window by one step and then performs the search. Therefore, after the step 204 in the embodiment of the present invention, the method may further include:

Step 209, it is determined whether the fingerprint value of the data in the first query unit is exactly the same as the fingerprint value of the target data, if yes, then proceeds to step 210, and if not, proceeds to step 205;

Because the data in the sliding window does not completely overlap with the target data, in order to avoid that the stored data may only have a certain length of offset from the data in the sliding window, the embodiment of the present invention will After the sliding window slides for one step and then searches, and the data size before the start position of the sliding window reaches the rated size of the data block, it can be known that the sliding window has been slid by the length of the data block of the rated size, and The data of this length is also the data covered by the previous sliding window, because the previous sliding window covers the data and it has been judged that the data does not completely overlap with the target data, so at this time, the current sliding window can be directly The data before the start position is directly stored as the new data. Therefore, when it is determined that the fingerprint value of the data in the first query unit is not exactly the same as the fingerprint value of the target data, the process may directly proceed to step 205; Deduplicate the data in the first query unit.

When the data fingerprint value in the first query unit and the fingerprint value of the target data are completely repeated, it is determined that the data in the first query unit is duplicate data, and the specific method for performing data deletion on the duplicate data may refer to the prior art. .

It should be noted that when the data in the first query unit is determined to be duplicate data, the data is deleted. In this case, the data before the sliding window needs to be processed, regardless of the previous window. Whether the size of the data reaches the size of the data block of the rated size. Therefore, after the step 209, the embodiment of the present invention further includes:

Step 211: The data before the start position of the sliding window is used as a second query unit, where the second query unit includes at least one minimum data block, in the reverse direction of the sliding window. The at least one minimum data block in the second query unit constructs an index value having the same length as the index value corresponding to the first query unit, and querying in the index table whether the second query unit corresponds to the second query unit. a second index value with the same index value, and if so, proceeds to step 212; if not, proceeds to step 213;

The second query unit may include only one minimum data block, for example, 4 KB data. In this case, it is also necessary to construct an index value corresponding to the first query unit according to the minimum data block. Index values of the same length;

If the second query unit includes a plurality of minimum data blocks, it is required to construct an index value having the same length as the index value corresponding to the first query unit according to all the minimum data blocks included, for example, if the index value The length needs 40 bits, there are two minimum data blocks, then 20 bits are needed to obtain the 40-bit index value from the fingerprint value of each minimum data block; Step 212, find the second query unit Whether there is data in the data that is duplicated by the target data pointed to by the data storage address corresponding to the second index value. The method for searching for the duplicate data in step 212 may be: comparing the target data pointed to by the data storage address corresponding to the second index value to the data included in the second query unit after loading the deduplication processor Querying the duplicate data or sending the data included in the second query unit to the data storage address corresponding to the second index value to query the duplicate data; or the data storage address corresponding to the first index value may be pointed to Fingerprint value corresponding to the target data and the first check The fingerprint values corresponding to the data included in the unit are compared to find duplicate data. Step 213: The data in the second query unit is stored as new data, and the correspondence between the index value corresponding to the second query unit and the data storage address in the second query unit is inserted into the index table. in.

In the embodiment of the present invention, by constructing an index value for a plurality of minimum data partitions, the data is reference counted, compressed, etc., in units of rated size data blocks, the memory occupation is greatly reduced, and the data search process is repeated. In the case of reducing the deduplication rate caused by the data offset, the repeated data is searched in a form in which the data block of the rated size and the data block smaller than the rated size are mixed, and the deduplication rate is improved.

Referring to FIG. 3, an embodiment of the present invention further provides another data processing method. For the description of the flow of the same part of the method embodiment corresponding to FIG. 2, refer to the embodiment corresponding to FIG. 2, and the method embodiment corresponding to FIG. The difference is that when the same index value is not found in the index table, in order to improve the probability of finding the same index, after the index value is changed, the search is continued. The data processing method described in FIG. 3 includes:

Step 301: The deduplication processor uses data in the sliding window in the data of the repeated data query as a nominal size data block, the nominal size data block includes a plurality of minimum data blocks, and the minimum data block is used to perform repeated data. The data block of the smallest query unit found; the data in the sliding window is used as a first query unit;

Step 302: Perform an index construction, including: constructing, according to the plurality of minimum data blocks in the first query unit, a preset length index value of the first query unit;

Step 303: Query whether there is a first index value that is the same as the index value corresponding to the first query unit, and if the index value corresponding to the first query unit is the same in the index table, The first index value, then proceeds to step 308; if not, then proceeds to step 304; Step 304: Query, in the index table, a third index value that matches the index value corresponding to the first query unit with a matching degree equal to or higher than a preset matching degree. If yes, go to step 308; Go to step 306;

It should be noted that obtaining the third index value and obtaining the first index value may be completed in one step in the actual operation, and the embodiment of the present invention is logically written in two steps for the sake of clearer description;

Step 306, it is determined whether more than one permutation combination period, if not exceeded, then proceeds to step 307; if exceeded, proceeds to step 309;

Step 307: After the positions of the data in the index value corresponding to the first query unit are sequentially arranged and combined, the process returns to step 303. In the embodiment of the present invention, when the index value corresponding to the first query unit is not found in the index table, To the same index value, in order to improve the probability of finding the same index value, the position of the data in the index value corresponding to the first query unit is changed in sequence, and there are various ways of changing, and the embodiment of the present invention adopts the arrangement and combination. The specific data location order may be changed by dividing the index value corresponding to the first query unit into multiple parts, and the positions of the multiple parts in the index value corresponding to the first query unit. Performing the first arrangement of the first order; determining whether the index value corresponding to the first query unit after the combination is found in the index table, whether the matching degree is equal to or higher than the preset matching degree third index value, if not If found, the position of the plurality of parts in the index value corresponding to the first query unit is sequentially arranged in a second time; Continuing to determine whether the index value corresponding to the first query unit after the permutation combination finds the same third index value in the index table;

When the index value corresponding to the first query unit after the combination is found, the index value is found in the index table. If the same third index value is completed or all possible permutation combinations are completed, the permutation combination is stopped. Step 308: Find whether the data in the first query unit has a target pointed to by the data storage address corresponding to the first index value. Data multiplexed; step 309, determining data before the start position of the sliding window, whether the size of the nominal size data block is reached for the reverse direction of the sliding window sliding, if not, Then proceed to step 310; if yes, proceed to step 311;

Step 301, sliding the sliding window in a preset step, the data in the sliding window after sliding as a first query unit, return to step 302;

Step 311: The data before the start position of the sliding window is stored as new data. Step 312: Insert the obtained correspondence between the index value corresponding to the new data and the new data storage address into the In the index table.

In the embodiment corresponding to FIG. 3, when searching for an index in the re-index table, it is not required to find an identical index, and only the matching degree of the index value is required to be higher than or equal to a preset matching degree, for example, a preset index value. The data is 70% identical, only 30% is different, so the index value that is considered to be similar to the index value of the first query unit is selected from the index table, and the index value of the similarity draft is corresponding in the index table. The target data of the data address is compared with the data in the first query unit to query the duplicate data. Therefore, the method may further include after step 308:

Step 313, it is determined whether the data in the first query unit and the data pointed to by the data address corresponding to the index value are completely repeated; if not completely repeated, proceed to step 314;

The specific method of judging whether or not to completely repeat can be judged by comparing the fingerprint values.

Step 314: According to the data pointed to by the data address corresponding to the queried index value, The data in the first query unit is delta compressed;

The specific delta compression algorithm may have an algorithm such as zdelt or vcdiff or xdelta, which is not limited in the embodiment of the present invention;

Step 315: Store data obtained by performing delta compression, and insert a correspondence between an index value corresponding to the first query unit and a storage address of the data obtained by the delta compression into the index table.

In the embodiment of the present invention, if the data in the first query unit and the data pointed to by the data address corresponding to the queried index value are completely duplicated, the method may further include:

Step 316: Perform data deletion on the data in the first query unit.

Step 317: The data before the start position of the sliding window is used as a second query unit. The length of the index value corresponding to the first query unit is constructed according to the at least one minimum data block in the second query unit. The same index value, in the index table, query whether there is a second index value that is the same as the index value corresponding to the second query unit, and if yes, go to step 318; if no, go to step 319; Step 318, find Whether there is data in the data in the second query unit that is duplicated by the target data pointed to by the data storage address corresponding to the second index value. The finding the duplicate data in the step 318 may be: loading the target data pointed to by the data storage address corresponding to the second index value into the deduplication processor, and comparing with the data included in the second query unit to query Deduplicating the data or sending the data included in the second query unit to the data storage address corresponding to the second index value to query the duplicate data, or may be the target that points the data storage address corresponding to the first index value The fingerprint value corresponding to the data is compared with the fingerprint value corresponding to the data included in the first query unit to find duplicate data. After the repeated data query is completed in step 318, if the data is completely repeated, the data of the second query unit is deduplicated. If the data is not completely repeated, delta compression may be performed or the data of the second query unit may be directly used as new data. For the storage, the present invention is not limited. The embodiment of the present invention takes as an example the new data is not completely repeated.

Step 319: The data in the second query unit is stored as new data, and the correspondence between the index value corresponding to the second query unit and the data storage address in the second query unit is inserted into the index table. in.

The data processing method provided by the embodiment of the present invention performs index searching for the rated size data block including the plurality of minimum query units, thereby greatly reducing the memory occupation. Further, the embodiment of the present invention adopts the minimum query unit by sliding the window. After sliding, the judgment is made to avoid the deduplication rate caused by the offset in the data search, and the repeated data search of the mixed granularity is realized, and the deduplication rate is improved while the memory occupation is reduced.

The embodiment of the present invention further provides a data processing apparatus, which is used to perform the method provided in the foregoing embodiment. The principle and the technical effect of the implementation are similar to the method provided by the embodiment of the present invention. The data processing device may be a deduplication processor in a specific implementation, or may be any device that performs the same function, such as a storage node installed with a deduplication processor, and is applied to a data processing system, the data processing system. Including the data processing device and a storage node, the data processing device communicating with the storage node;

Referring to FIG. 4, an embodiment of the present invention provides a structure of a data processing apparatus, including: an index construction unit 401, configured to: an index structure, where the index structure includes: data that needs to be repeated in a data query is covered by a sliding window As a first query unit, the data is extracted from the fingerprint values of each of the smallest data blocks in the first query unit, and the extracted bits are extracted. An index value of a preset length corresponding to the first query unit, where the first query unit includes a plurality of minimum data blocks, and the minimum data block is data of a minimum query unit for performing repeated data search. The index matching unit 402 is configured to query, in a preset index table, whether there is an index value that is the same as the index value corresponding to the first query unit, and obtain a matching result;

The duplicate data searching unit 403 is configured to: if the index matching unit searches for the first index value that is the same as the index value corresponding to the first query unit in the index table, search for data in the first query unit. Whether there is data that is duplicated by the target data pointed to by the data storage address corresponding to the first index value.

In the embodiment of the present invention, the first query unit includes a plurality of minimum data blocks, and a partial index value is obtained from the index value of each minimum data block to construct an index of a preset length, which greatly reduces the index in the memory. quantity. Wherein, if the repeated data searching unit 403 obtains that the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are completely duplicated, the index construction unit 401 is further used. The data before the start position of the sliding window is used as a second query unit, wherein the previous direction is the reverse direction of the sliding window sliding, according to the at least one minimum in the second query unit. The data block constructs an index value of the preset length, where the second query unit includes at least one minimum data block; the index matching unit 402 is further configured to query, in the index table, whether The second query unit corresponds to a second index value with the same index value;

The duplicate data lookup 403 unit is further configured to: if the index matching unit 402 is in the index table Searching for the second index value that is the same as the index value corresponding to the second query unit, searching whether the data in the second query unit has target data pointed to by the data storage address corresponding to the second index value Repeated data. The data processing apparatus may further include: the first storage unit 404, if the index matching unit 401 does not query the second index value that is the same as the index value corresponding to the second query unit in the index table, the data processing apparatus may further include: , for storing data in the second query unit;

The first index table updating unit 405 is configured to insert a correspondence between an index value corresponding to the data of the second query unit and a storage address of the data in the second query unit into the preset index table.

In the embodiment of the present invention, the index value corresponding to the first query unit is included in the index table, and the index value corresponding to the second query unit is also included, and the size of the data block included in the first query unit and the second query unit is different. A hybrid index value corresponding to a plurality of data block sizes is formed in the index table, and the memory is reduced, and the double deletion search rate is improved. The device may further include: if the duplicate data searching unit 403 finds that the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely duplicated, the device may further include:

The first determining unit 406 is configured to determine, when the duplicate data searching unit 403 finds that the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated. Sliding data before the start position of the window, whether the size of the first query unit is reached for the reverse direction of the sliding of the sliding window;

a first instruction unit 407, configured to determine, by the first determining unit 406, the sliding window determination When the data before the start position of the sliding window does not reach the size of the first query unit, the sliding window is slid in a preset step size;

The index construction unit 401 is further configured to construct a index by using the data in the sliding window after sliding as a first query unit.

If the index matching unit 402 does not query the first index value that is the same as the first query unit in the index table, the device may further include: a second determining unit 408, configured to: When the repeated data searching unit 403 finds that the target data pointed to by the data in the first query unit and the data storage address corresponding to the first index value is not completely repeated, determining data before the start position of the sliding window The second instruction unit 409 is configured to determine, in the second determining unit 408, the sliding window to determine whether the size of the first query unit is reached in the opposite direction of the sliding of the sliding window. When the size of the data before the start position of the sliding window does not reach the size of the first query unit, the sliding window is slid in a preset step size;

After the second determining unit 408 determines that the data size before the start position of the sliding window reaches the size of the first query unit, it indicates that the data size of the sliding window has been equal to the size of a first query unit. The size of the swept data block is also the data size covered by the sliding window, and the same index is not found in the index table. Therefore, the data in front of the window is directly stored as new data. Therefore, the data processing apparatus may further include: a second storage unit 410, configured to: if the second determining unit 408 determines the start position of the sliding window The data size before the setting reaches the size of the first query unit, and the data before the start position of the sliding window is stored as new data; the correspondence of the storage addresses is inserted into the index table.

In the embodiment of the present invention, by constructing an index value for a plurality of minimum data partitions, the data is reference counted, compressed, etc., in units of rated size data blocks, the memory occupation is greatly reduced, and the data search process is repeated. In the case of reducing the deduplication rate caused by the data offset, the repeated data is searched in a form in which the data block of the rated size and the data block smaller than the rated size are mixed, and the deduplication rate is improved. Referring to FIG. 5, an embodiment of the present invention further provides a data processing apparatus, which provides an optimized solution based on the structural diagram of the apparatus provided in FIG. 4, and the apparatus in the embodiment corresponding to the present invention and FIG. The difference is that if the first index value that is the same as the index value corresponding to the first query unit is not found in the index table, whether the size of the data before the start position of the sliding window reaches the first Before the size of the query unit, the positional relationship of the data in the index value corresponding to the first query unit is arranged and combined, and then the index values of the array combination are matched in the index table to improve the matching rate. The data processing apparatus provided by the embodiment of the present invention includes: an index construction unit 501, configured to: an index structure, where the index structure includes: data that is covered by the sliding window in the data that needs to be repeated data query as a first query unit And extracting a partial bit from each of the fingerprint values of the minimum data block in the first query unit, and extracting the extracted bits into an index value of a preset length corresponding to the first query unit, where The first query unit includes a plurality of minimum data blocks, and the minimum data block is a data block of a minimum query unit for performing repeated data search; The index matching unit 502 is configured to query, in a preset index table, whether there is an index value that is the same as the index value corresponding to the first query unit, to obtain a matching result, and a duplicate data searching unit 503, configured to: if the index matching unit Querying, in the index table, the first index value that is the same as the index value corresponding to the first query unit, and searching whether the data in the first query unit has a data storage address corresponding to the first index value. Duplicate data for the target data pointed to. If the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are completely duplicated, the index construction unit 501 is further used. The data before the start position of the sliding window is used as a second query unit, wherein the previous direction is the reverse direction of the sliding window sliding, according to the at least one minimum in the second query unit. The data block constructs an index value of the preset length, where the second query unit includes at least one minimum data block, and the index matching unit 502 is further configured to query, in the index table, whether The second query unit corresponds to a second index value with the same index value; the repeated data search 503 unit is further configured to: if the index matching unit 502 queries the index value corresponding to the second query unit in the index table The same second index value is used to find whether the data in the second query unit has a data storage address corresponding to the second index value. Data duplication of data. The data processing apparatus may further include: if the index matching unit 501 does not query the second index value that is the same as the index value corresponding to the second query unit in the index table, the data processing apparatus may further include: a first storage unit 504, configured to store data in the second query unit; a first index table update unit 505, configured to use an index value corresponding to data of the second query unit and data in the second query unit The correspondence between the storage addresses is inserted into the preset index table. If the index matching unit 502 does not query the first index value that is the same as the index value corresponding to the first query unit, the index matching unit 502 may be further configured to: query and query in the index table. The third index value corresponding to the index matching value of the first query unit is equal to or higher than the preset matching degree; the index value corresponding to the first query unit in the index table is equal to or higher than the preset value. a third index value of the matching degree, the positional order of the data in the index value corresponding to the first query unit is sequentially arranged and combined, and the index value corresponding to the first query unit after performing the permutation and combination is determined in the index Whether the same third index value is found in the table; correspondingly, the data processing apparatus may further include a second determining unit 508 and a second command unit

509. The second determining unit 508 determines, before the index matching unit finally fails to match the third index value, a data size before the start position of the sliding window, where the previous is for the sliding window. In the reverse direction of the sliding, whether the size of the first query unit is reached, and the result is sent to the second instruction unit 509; the second instruction unit 509 is configured to determine, at the second determining unit 508, the When the sliding window determines that the data size before the start position of the sliding window does not reach the size of the first query unit, the sliding window is slid in a preset step size; the index construction unit 501 is further configured to Sliding the data in the sliding window as a A query unit, constructing an index. The data processing apparatus may further include: a second storage unit 510, configured to: if the second determining unit 508 determines that the data size before the start position of the sliding window reaches the first query unit Size, the data before the start position of the sliding window is stored as new data;

The correspondence of the storage addresses is inserted into the index table. The duplicate data searching unit 503 is further configured to: if the index matching unit 502 searches the index table, the first index value corresponding to the first query unit is equal to or higher than a preset matching degree. And the third index value is used to search for data in the first query unit for data that is duplicated by the target data pointed to by the data storage address corresponding to the third index value. The data processing apparatus may further include: if the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely duplicated, the data processing apparatus may further include:

a delta compression unit 506, configured to acquire a data search result of the duplicate data searching unit 503, if the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, And performing delta compression on the data in the first query unit with respect to the target data pointed to by the data storage address corresponding to the first index value; and storing, by the second storage unit 510, data stored in the delta compression to the storage The second index table updating unit 511 is further configured to insert the correspondence between the index value corresponding to the first query unit and the data storage address after the delta compression into the index table. The data processing apparatus provided by the embodiment of the present invention performs index searching for the rated size data block including the plurality of minimum query units, thereby greatly reducing the memory occupation. Further, the embodiment of the present invention adopts the minimum query unit by sliding the window. After the sliding, the judgment is performed to avoid the deduplication rate caused by the offset in the data search, the repeated data search of the mixed granularity is realized, the deduplication rate is improved, the memory occupation is reduced, and when index matching is performed, By arranging and combining the data in the index values in the first query unit, the repeated data search rate is improved.

Referring to FIG. 6, an embodiment of the present invention further provides a deduplication processor 600, including a processor 61, a memory 62, a communication interface 63, and a communication bus 64.

The processor 61, the communication interface 63, and the memory 62 communicate with each other through the communication bus 64; the communication interface is configured to receive and transmit data;

The memory 62 is for storing a program; the memory 62 may include a high speed RAM memory, and may also include a non-volatile memory such as at least one disk memory;

The processor 61 is configured to execute the program in the memory to execute a data processing method as provided by the foregoing method embodiments.

The functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including The instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a USB flash drive, a mobile hard disk, and a read only memory (ROM, Read-Only) Memory ), Random Access Memory (RAM), disk or optical disk, and other media that can store program code. The above is only the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the claims.

Claims

Rights request

1. A data processing method, characterized in that the method is applied to a data processing system, the data processing system includes a deduplication processor, and the method includes: the deduplication processor covers the needs of the sliding window The data for repeated data search is used as the first query unit, and the first query unit includes a plurality of minimum data blocks, and the minimum data block is the data block of the minimum query unit for repeated data search; for the first query unit Index construction and repeated data search are carried out on the data in the first query unit; the index construction includes: respectively extracting some bits from the fingerprint value of each minimum data block in the first query unit, and using the extracted bits to form the third An index value of a preset length corresponding to a query unit; The repeated data search includes: querying in a preset index table whether there is an index value that is the same as the index value corresponding to the first query unit. If in the If the first index value that is the same as the corresponding index value of the first query unit is found in the index table, then search whether the data in the first query unit has a target pointed to by the data storage address corresponding to the first index value. Data is duplicated.

2. The method according to claim 1, characterized in that, the method further includes: If the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are completely repeated: The data before the starting position of the sliding window is used as the second query unit. The before is for the opposite direction of the sliding window sliding. The second query unit includes at least one minimum data block. According to the first The at least one minimum data block in the two query units constructs an index value of the preset length, and queries the index table to see whether there is an index value that is the same as the corresponding index value of the second query unit. the second index value of The data storage address corresponding to the second index value points to duplicate data of the target data.

3. The method according to claim 2, further comprising: if the second index value that is the same as the index value corresponding to the second query unit is not found in the index table, then adding the second index value to the second query unit. The corresponding relationship between the index values corresponding to the two query units and the storage address of the data in the second query unit is inserted into the index table.

4. The method according to claim 1, characterized in that, if the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, the method further includes: determining Whether the data size before the starting position of the sliding window reaches the size of the first query unit, which is for the opposite direction of sliding of the sliding window, if not, slide all the data with a preset step size. In the sliding window, the data in the sliding window after sliding is used as a first query unit, and the step of constructing an index and the step of searching for repeated data are performed.

5. The method according to claim 1 or 2 or 3, characterized in that if the first index value that is the same as the index value corresponding to the first query unit is not found in the index table, the method It also includes: determining whether the data size before the starting position of the sliding window reaches the size of the first query unit, where the previous position is for the opposite direction of sliding of the sliding window, and if not, then using a preset The sliding window is slid in steps, and the data in the sliding window after sliding is used as the first query order. bit, perform the step of constructing an index and the step of searching for repeated data.

6. The method according to claim 5, characterized in that, if it is determined that the data size before the starting position of the sliding window reaches the size of the first query unit, then the data before the starting position of the sliding window is Store it as new data, and insert the corresponding relationship between the index value corresponding to the new data and the new data storage address into the index table.

7. The method according to claim 5, characterized in that, if the first index value that is the same as the index value corresponding to the first query unit is not queried in the index table, before determining the starting point of the sliding window Whether the size of the data before the position reaches the size of the first query unit, the method further includes: querying in the index table a matching degree with the index value corresponding to the first query unit that is equal to or higher than a preset match If there is no third index value in the index table that has a matching degree equal to or higher than the preset matching degree with the index value corresponding to the first query unit, then the first query unit corresponding to Arrange and combine the numerical values in the index values in order, and determine whether the index value corresponding to the first query unit after the arrangement and combination is found in the index table. If the third index value is not found, enter The step of determining whether the data before the starting position of the sliding window reaches the size of the first query unit.

8. The method according to claim 7, characterized in that: the position order of the data in the first query unit is arranged and combined, and it is determined that the index value corresponding to the first query unit after the arrangement and combination is in Whether the third index value is found in the index table includes: dividing the index value corresponding to the first query unit into multiple parts, and placing the multiple parts in the index corresponding to the first query unit. The position order in the value is arranged and combined for the first time; Determine whether the index value corresponding to the permuted and combined first query unit is found in the index table to find a third index value with a matching degree equal to or higher than the preset matching degree. If not found, then the multiple index values are found. The positions of the parts in the index value corresponding to the first query unit are sequentially arranged and combined for the second time; continue to determine whether the index value corresponding to the first query unit after the arrangement and combination is found to be the same in the index table. The third index value of If the third index value of the matching degree is reached, the permutation and combination is stopped, and the step of judging whether the size of the data before the starting position of the sliding window reaches the size of the first query unit is entered.

9. The method according to claim 7, characterized in that, if a third index value corresponding to the first query unit whose matching degree is equal to or higher than a preset matching degree is queried in the index table, index value, then search whether there is data in the data in the first query unit that is duplicated with the target data pointed to by the data storage address corresponding to the third index value.

10. The method according to claim 9, further comprising: obtaining a duplicate data search result, if the data in the first query unit and the data storage address corresponding to the first index value point to the target data If the data is not completely repeated, perform delta compression on the data in the first query unit relative to the target data pointed to by the data storage address corresponding to the first index value, store the data after delta compression, and store the data in the first query unit. The corresponding relationship between the index value corresponding to the first query unit and the delta-compressed data storage address is inserted into the index table.

11. The method according to any one of claims 1 to 3, characterized in that: extracting part of the bits from the fingerprint value of each minimum data block in the first query unit, and combining the extracted bits into An index value of a preset length corresponding to the first query unit includes: respectively obtaining the fingerprint value of each minimum data block in the first query unit, and extracting a predetermined index value from the fingerprint value corresponding to each minimum data block. Assuming the same number of bits, all the extracted bits are composed into an index value corresponding to the preset length of the first query unit.

12. The method according to any one of claims 1 to 3, characterized in that, the index table is stored in the deduplication processor, and the preset index table is queried to see whether there is any data related to the first query unit. Index values corresponding to the same index value include: querying in the preset index table in the deduplication processor whether there is an index value that is the same as the index value corresponding to the first query unit; or the data processing system also It includes a storage node, the preset index table is stored in the storage node, and the query in the preset index table whether there is an index value that is the same as the index value corresponding to the first query unit includes: The deduplication processor sends the index value corresponding to the first query unit to the storage node, receives the query result of the storage node, and thereby obtains a query in the preset index table to see whether there is an index value corresponding to the first query unit. Query information about index values whose units correspond to the same index value.

13. A data processing device, characterized in that it includes: an index construction unit, used for index construction, the index construction includes: using the data covered by the sliding window among the data requiring repeated data query as a first query unit , extract some bits from the fingerprint value of each minimum data block in the first query unit, and use the extracted bits to form an index value of a preset length corresponding to the first query unit, where, First query unit mid-package It includes a plurality of minimum data blocks, and the minimum data block is the data block of the minimum query unit for repeated data search; an index matching unit is used to query whether there is an index corresponding to the first query unit in the preset index table Index values with the same value; a duplicate data search unit, used to search for the first index value if the index matching unit queries the first index value in the index table that is the same as the index value corresponding to the first query unit; Query whether the data in the unit has duplicate data with the target data pointed to by the data storage address corresponding to the first index value.

14. The data processing device according to claim 13, wherein the duplicate data search unit obtains the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value. Repeat completely, then the index construction unit is also used to use the data before the starting position of the sliding window as a second query unit, where the previous is for the opposite direction of the sliding window sliding, according to the The at least one minimum data block in the second query unit constructs an index value of the preset length, wherein the second query unit includes at least one minimum data block; the index matching unit is also used to Query in the index table whether there is a second index value that is the same as the corresponding index value of the second query unit; the duplicate data search unit is also used to query if the index matching unit queries in the index table If the second query unit corresponds to a second index value with the same index value, search whether there is data in the data in the second query unit that is duplicated with the target data pointed to by the data storage address corresponding to the second index value.

15. The data processing device according to claim 14, wherein if the index matching unit does not query the second index value in the index table that is the same as the index value corresponding to the second query unit, then It also includes: a first storage unit, used to store the data in the second query unit; a first index table update unit, used to compare the index value corresponding to the data in the second query unit with the index value in the second query unit. The corresponding relationship between the storage addresses of the data is inserted into the preset index table.

16. The data processing device according to claim 13, wherein if the duplicate data search unit finds the target data pointed to by the data storage address corresponding to the data in the first query unit and the first index value Not exactly a repeat, the device also includes:

A first judgment unit configured to judge the sliding window when the duplicate data search unit finds that the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated. The data before the starting position, which is for the opposite direction of sliding of the sliding window, whether it reaches the size of the first query unit; the first instruction unit is used to determine the first query unit in the first judgment unit. When the sliding window determines that the data before the starting position of the sliding window does not reach the size of the first query unit, the sliding window is slid with a preset step size; the index construction unit is also used to slide the sliding window. The data in the sliding window described later is used as a first query unit to construct an index.

17. The data processing device according to claim 13 or 14 or 15, characterized in that if The index matching unit does not query the first index value that is the same as the corresponding index value of the first query unit in the index table. The device further includes: a second judgment unit for searching in the duplicate data search unit When it is obtained that the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, determine the data before the starting position of the sliding window, which is for the sliding window. In the opposite direction of window sliding, whether the size of the first query unit is reached; the second instruction unit is used to determine the data size before the second judgment unit determines the sliding window and determines the starting position of the sliding window. When the size of the first query unit is not reached, slide the sliding window with a preset step size; the index construction unit is also used to use the data in the sliding window after sliding as a first query unit , to construct the index.

18. The data processing device according to claim 17, further comprising: a second storage unit, configured to determine if the second judgment unit determines that the data size before the starting position of the sliding window reaches the first A query unit size, then the data before the starting position of the sliding window is stored as new data; a second index table update unit is used to compare the index value corresponding to the new data with the new data storage address The corresponding relationship is inserted into the index table.

19. The data processing device according to claim 17, wherein if the index matching unit does not query the first index value that is the same as the index value corresponding to the first query unit in the index table, Used to query the matching degree of the index value corresponding to the first query unit in the index table A third index value that is equal to or higher than the preset matching degree; there is no third index value in the index table that matches the index value corresponding to the first query unit that is equal to or higher than the preset matching degree, then the The position order of the numerical values in the index value corresponding to the first query unit is arranged and combined, and it is determined whether the index value corresponding to the first query unit after the arrangement and combination finds the same third index in the index table. value; The second judgment unit is also used to judge the data size before the starting position of the sliding window when the index matching unit finally fails to match the third index value, and the previous value is for the In terms of the opposite direction of sliding of the sliding window, whether the size of the first query unit is reached, and the result is sent to the second instruction unit.

20. The data processing device according to claim 19, characterized in that if the index matching unit does not query the first index value that is the same as the index value corresponding to the first query unit in the index table, then The index matching unit is specifically configured to: divide the index value corresponding to the first query unit into multiple parts, and sequence the positions of the multiple parts in the index value corresponding to the first query unit for the first time. Arrange and combine; Determine whether the index value corresponding to the first query unit after the arrangement and combination finds a third index value in the index table with a matching degree equal to or higher than the preset matching degree. If not found, then The positions of the multiple parts in the index value corresponding to the first query unit are sequentially arranged and combined for the second time; continue to determine whether the index value corresponding to the first query unit after the arrangement and combination is in the index table. The same third index value is found; when all permutations and combinations are completed, the index value corresponding to the first query unit is not found in the index table with a matching degree equal to or higher than the first index value. Default matching degree third index value, stop the permutation and combination, and send the judgment result to the second judgment unit.

21. The data processing device according to claim 19, characterized in that, the repeated data search unit is also configured to query the first query unit corresponding to the first query unit if the index matching unit queries the index table. If the matching degree of an index value is equal to or higher than the third index value of the preset matching degree, then search whether there is target data pointed to by the data storage address corresponding to the third index value in the data in the first query unit. Duplicate data.

22. The device according to claim 21, further comprising:

Delta compression unit, used to obtain the duplicate data search results of the duplicate data search unit. If the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, then Relative to the target data pointed to by the data storage address corresponding to the first index value, perform delta compression on the data in the first query unit; the second storage unit is also used to perform delta compression on the data after completion of delta compression. Storage; The second index table update unit is also configured to insert the corresponding relationship between the index value corresponding to the first query unit and the delta-compressed data storage address into the index table.

23. The data processing device according to any one of claims 13 to 15, the index construction unit is specifically configured to obtain the fingerprint value of each minimum data block in the first query unit, corresponding to each minimum data block from Extract a preset same number of bits from the fingerprint value, and combine all the extracted bits into an index value corresponding to the preset length of the first query unit.

24. A deduplication processor, characterized in that it includes a processor, a memory, a communication interface, and a bus;

The processor, communication interface, and memory communicate with each other through the bus; the communication interface, For receiving and sending data;

The memory is used to store programs; the processor is used to execute the program in the memory, and perform the method as described in any one of claims 1-12.