CN113495901B

CN113495901B - Quick retrieval method for variable-length data blocks

Info

Publication number: CN113495901B
Application number: CN202110424974.2A
Authority: CN
Inventors: 徐振楠; 吕鑫; 吴涛; 高晟凯
Original assignee: Hohai University HHU; Huaneng Lancang River Hydropower Co Ltd; NetsUnion Clearing Corp
Current assignee: Hohai University HHU; Huaneng Lancang River Hydropower Co Ltd; NetsUnion Clearing Corp
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2023-10-13
Anticipated expiration: 2041-04-20
Also published as: CN113495901A

Abstract

The application discloses a quick retrieval method for variable-length data blocks, which comprises the steps of extracting the length of the data blocks and the byte value of part of positions; constructing an index tree; calculating and comparing fingerprints of the data blocks; the application constructs the index by extracting the length and bytes of the variable-length data block, and realizes the retrieval mode of firstly retrieving conflict and then calculating fingerprints, thereby reducing the fingerprint calculation process and improving the retrieval efficiency.

Description

Quick retrieval method for variable-length data blocks

Technical Field

The application relates to the technical field of data storage, in particular to a quick retrieval method for variable-length data blocks.

Background

An important indicator for measuring the ability of data de-duplication is the overhead, which mainly includes the overhead of fingerprint calculation and the overhead of fingerprint retrieval. In the storage system, as time goes, the stored data becomes larger and larger, and at this time, fingerprint retrieval and comparison occupy a large amount of computing resources, and meanwhile, disk IO is improved, so that retrieval efficiency is reduced. Therefore, search efficiency optimization is a main means of reducing system overhead.

The searching efficiency optimization is mainly realized by means of rapid judgment of a bloom filter, preloading of data by utilizing locality of the data, constructing hierarchical indexes by utilizing similarity of the data, optimizing index structures according to different storage media, optimizing index structures according to stored data types and the like.

At present, the repeated data retrieval mode mainly comprises the following steps: the fingerprint of the data block is calculated first, then the fingerprint index table is compared, and whether the data block is repeated or not is judged. However, the fingerprints of the data blocks generally employ secure hash functions such as SHA256, SHA-3, etc., and computing the fingerprints first takes a lot of computation time.

Disclosure of Invention

The application aims to provide a quick retrieval method for variable-length data blocks, which is used for realizing a retrieval mode of firstly retrieving conflict and then calculating fingerprints by extracting the length and bytes of the variable-length data blocks to construct indexes, thereby reducing the process of calculating fingerprints and improving the retrieval efficiency.

The technical scheme adopted by the application is as follows:

a quick search method for variable-length data blocks comprises the following steps:

s1, reading a variable-length data block group to be retrieved;

s2.1, inputting a data block of the variable-length data block group;

step S2.2, if the current input is empty, outputting a judging result and stopping searching, otherwise extracting the length L of the current data block and byte values A0, A1, A2 and … of partial positions;

step S3, mapping the length of the current data block to the [0,255] interval, and constructing an index tree by taking the mapped value as a first child node of the index tree;

and S4, calculating the fingerprint of the data block, comparing the fingerprints, adding the information of the data block as a child node of the current node, and returning to the step S1.

Further, the step S3 specifically includes:

step S3.1, calculating k=min (L, 65536) mod 256, where K represents a value mapping the current data block length to the [0,255] interval, L represents the data block length, min (L, 65536) represents taking the smaller value of L and 65536, for dividing the data block with a length exceeding 64KB into the same sub-node, mod represents modulo operation;

and S3.2, sequentially taking K, A0, A1, A2 and … as keys, and constructing an index tree S-K-A0-A1-A2- …, wherein S represents a root node of the index tree.

Further, the step S4 specifically includes:

step S4.1, if no other data block exists under the current index, the current detected data block is the unique block, and the step S2 is returned; otherwise, enter step S4.2;

step S4.2, respectively calculating the fingerprints of the data blocks without calculating the fingerprints under the current index and the fingerprints of the data blocks currently detected;

s4.3, comparing the fingerprints of the currently detected data block with fingerprints of other data blocks under the index, if the fingerprints exist, the currently detected data block is a repeated block, otherwise, the currently detected data block is a non-repeated block;

and S4.4, adding the data block information as a child node of the current node, and returning to the step S2.

Further, the data block fingerprint is a secure hash function value of the data block.

Further, the extracted partial position byte value is defined as 1 st byte value, last 1 st byte value, L/2 nd byte value]Byte value, 2 nd ⁿ A byte value, wherein L represents a data block length, n is a natural number and 2 ⁿ ≤L。

Compared with the prior art, the application has the following beneficial effects:

(1) The application introduces the block size as one of indexes, and can be applied to a content-based block method;

(2) According to the application, the index is constructed according to the block content, and when a plurality of blocks exist under the index, fingerprint comparison is calculated, so that the number of fingerprint calculation times can be reduced, and the retrieval efficiency is improved.

Drawings

FIG. 1 is a flow chart of the present application;

FIG. 2 is a schematic diagram showing the specific steps of example 2.

Detailed Description

The application is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present application, and are not intended to limit the scope of the present application.

Example 1

As shown in fig. 1, a fast search method for variable-length data blocks is as follows:

(1) Inputting a group of blocks to be repeatedly detected;

(2) A block is entered. If the current input is empty, terminating the search; otherwise go to step (3)

(3) Extracting the length L of the block, the first byte A, the last byte B and the [ L/2] th byte C;

(4) Calculate k=min (L, 65536) mod 256;

(5) Sequentially taking K, A, B and C as nodes to construct an index tree S-K-A-B-C, wherein S represents A root node of the index tree;

(6) If no other block exists under the current index, the current detection block is the only block, and the step (2) is returned; otherwise, entering the next step;

(7) Calculating fingerprints of the data blocks with no fingerprints calculated under the current index and the currently detected data blocks;

(8) And (3) comparing the fingerprints of the currently detected data block with fingerprints of other blocks under the index, if the fingerprints exist, the currently detected block is a repeated block, otherwise, the currently detected block is a non-repeated block, and returning to the step (2).

Example 2

As shown in fig. 2, a fast search method for variable-length data blocks:

(1) Reading a variable length data block group to be retrieved;

(2) Inputting a data block 1 of one of the variable-length data block groups;

(3) Extracting byte values 1D,34,50 of the length L and partial positions of the data block 1;

(4) Mapping the length of the data block 1 to the [0,255] interval, and calculating K=min (L, 65536) mod 256, wherein K represents a value for mapping the length of the current data block to the [0,255] interval, L represents the length of the data block, min (L, 65536) represents a smaller value in L and 65536 and is used for dividing the data block with the length exceeding 64KB into the same sub-node, and mod represents modulo operation;

(5) Sequentially taking K,1D,34 and 50 as keys to construct an index tree S-K-1D-34-50, wherein S represents a root node of the index tree;

(6) No other data block exists under the current index, and the data block 1 is a unique block;

(7) Inputting a data block 2 of said variable length data block group;

(8) Extracting byte values 1D,34,50 of the length L and partial positions of the data block 2;

(9) Mapping the length of the data block 2 to the [0,255] interval, and calculating K=min (L, 65536) mod 256;

(10) Sequentially taking K,1D,34 and 50 as keys to construct an index tree S-K-1D-34-50;

(11) Other data blocks exist under the current index, the data block 2 is not the only block, and fingerprints need to be compared;

(12) Respectively calculating the fingerprints of the data blocks without calculating the fingerprints under the index of the data block 2 and the fingerprints of the currently detected data blocks;

(13) Comparing the fingerprints of the data block 2 with the fingerprints of other data blocks under the index, if the fingerprints exist, the currently detected data block is a repeated block, otherwise, the currently detected data block is a non-repeated block;

(14) Adding the data block 2 information as a child node of the current node;

(15) Inputting a data block 3 of one of the variable-length data block groups;

(16) Extracting byte values 82,9D,12 of the length L and partial positions of the data block 3;

(17) Mapping the length of the data block 3 to the [0,255] interval, and calculating K=min (L, 65536) mod 256;

(18) Sequentially taking K,82,9D and 12 as keys to construct an index tree S-K-82-9D-12;

(19) No other data block exists under the current index, and the data block 3 is the only block.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely a preferred embodiment of the present application, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present application, and such modifications and variations should also be regarded as being within the scope of the application.

Claims

1. The quick searching method for the variable-length data block is characterized by comprising the following steps of:

s1, reading a variable-length data block group to be retrieved;

s2, extracting the length and partial position byte values of the data blocks in the data block group;

the extracted partial position byte value is defined as 1 st byte value, last 1 st byte value, L/2 nd byte value]Byte value, 2 nd ⁿ A byte value, wherein L represents a data block length, n is a natural number and 2 ⁿ ≤L；

The step S2 specifically includes:

s2.1, inputting a data block of the variable-length data block group;

the step S3 specifically includes:

s3.2, constructing an index tree S-K-A0-A1-A2- … by sequentially taking K, A0, A1, A2 and … as keys, wherein S represents a root node of the index tree;

s4, calculating a data block fingerprint, comparing the fingerprints, adding the data block information as a child node of the current node, and returning to the step S1;

the step S4 specifically includes:

2. The method for fast retrieval of data blocks with variable length according to claim 1, wherein the data block fingerprint is a secure hash function value of the data block.