CN113495901A

CN113495901A - Variable-length data block oriented quick retrieval method

Info

Publication number: CN113495901A
Application number: CN202110424974.2A
Authority: CN
Inventors: 徐振楠; 吕鑫; 吴涛; 高晟凯
Original assignee: Hohai University HHU; Huaneng Lancang River Hydropower Co Ltd; NetsUnion Clearing Corp
Current assignee: Hohai University HHU; Huaneng Lancang River Hydropower Co Ltd; NetsUnion Clearing Corp
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2021-10-12
Anticipated expiration: 2041-04-20
Also published as: CN113495901B

Abstract

The invention discloses a variable-length data block oriented quick retrieval method, which comprises the steps of extracting the length of a data block and partial position byte values; constructing an index tree; calculating and comparing the fingerprints of the data blocks; the invention realizes the retrieval mode of firstly retrieving conflict and then calculating the fingerprint by extracting the length and the byte of the variable-length data block to construct the index, thereby reducing the process of fingerprint calculation and improving the retrieval efficiency.

Description

Variable-length data block oriented quick retrieval method

Technical Field

The invention relates to the technical field of data storage, in particular to a quick retrieval method for variable-length data blocks.

Background

An important index for measuring the deduplication capability is system overhead, which mainly includes overhead of fingerprint calculation and overhead of fingerprint retrieval. In a storage system, the stored data will be larger and larger along with the migration of time, and at this time, the fingerprint retrieval and comparison not only occupies a large amount of computing resources, but also improves the disk IO, resulting in a reduction in retrieval efficiency. Therefore, search efficiency optimization is a major means to reduce system overhead.

The retrieval efficiency optimization mainly depends on the means of bloom filter rapid judgment, data local preloading, hierarchical index construction by data similarity, index structure optimization according to different storage media, index structure optimization according to the type of stored data and the like.

At present, the repeated data retrieval mode is mainly as follows: firstly, calculating the fingerprint of the data block, then comparing the fingerprint index table and judging whether the data block is repeated. However, the fingerprints of the data blocks generally adopt secure hash functions such as SHA256, SHA-3 and the like, and the calculation of the fingerprints first takes a lot of calculation time.

Disclosure of Invention

The invention aims to provide a variable-length data block-oriented quick retrieval method, which realizes a retrieval mode of firstly retrieving conflicts and then calculating fingerprints by extracting the length and the byte of a variable-length data block to construct an index, thereby reducing the process of fingerprint calculation and improving the retrieval efficiency.

The technical scheme adopted by the invention is as follows:

a quick retrieval method facing variable-length data blocks comprises the following steps:

step S1, reading a variable-length data block group to be retrieved;

s2.1, inputting a data block of the variable-length data block group;

s2.2, if the current input is null, outputting a judgment result and terminating the retrieval, otherwise, extracting the length L of the current data block and the byte values A0, A1, A2 and … of the partial positions;

step S3, mapping the length of the current data block to the interval [0,255], and constructing an index tree by taking the mapped value as the first child node of the index tree;

and step S4, calculating the fingerprint of the data block, comparing the fingerprint, adding the data block information as the child node of the current node, and returning to the step S1.

Further, in step S3, specifically, the step includes:

step S3.1, calculating K ═ min (L,65536) mod 256, where K denotes a value that maps the current data block length to the [0,255] interval, L denotes a data block length, min (L,65536) denotes taking the smaller value of L and 65536 for dividing a data block with a length exceeding 64KB into the same child node, and mod denotes a modulo operation;

and S3.2, sequentially using K, A0, A1, A2 and … as keys to construct an index tree S-K-A0-A1-A2- …, wherein S represents a root node of the index tree.

Further, in step S4, specifically, the step includes:

s4.1, if no other data block exists under the current index, the currently detected data block is the only block, and the step S2 is returned; otherwise, go to step S4.2;

s4.2, respectively calculating the fingerprints of the data blocks without the fingerprints under the current index and the fingerprints of the currently detected data blocks;

s4.3, comparing the fingerprint of the currently detected data block with the fingerprints of other data blocks under the index, if the fingerprint exists, determining that the currently detected data block is a repeated block, and if not, determining that the currently detected data block is a non-repeated block;

and S4.4, adding the data block information as a child node of the current node, and returning to the step S2.

Further, the data block fingerprint is a secure hash function value of the data block.

Further, the extracted partial position byte value is defined as the 1 st byte value, the last 1 byte value, and the [ L/2] th byte value]Byte value, 2 ndⁿA byte value, where L represents a data block length, n is a natural number and 2ⁿ≤L。

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention introduces the block size as one of indexes, and can be applied to a content-based blocking method;

(2) according to the method, the index is constructed according to the block content, when a plurality of blocks exist under the index, the fingerprint comparison needs to be calculated, the times of fingerprint calculation can be reduced, and the retrieval efficiency is improved.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram showing the detailed steps of example 2.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Example 1

As shown in fig. 1, a fast retrieval method for variable-length data blocks:

(1) inputting a group of blocks needing repeated detection;

(2) a block is input. If the current input is null, terminating the search; otherwise, go to step (3)

(3) Extracting the length L, the first byte A, the last byte B and the [ L/2] th byte C of the block;

(4) calculating K min (L,65536) mod 256;

(5) sequentially taking K, A, B and C as nodes to construct an index tree S-K-A-B-C, wherein S represents a root node of the index tree;

(6) if no other block exists under the current index, the current detection block is the only block, and the step (2) is returned; otherwise, entering the next step;

(7) calculating fingerprints of data blocks of which fingerprints are not calculated under the current index and the currently detected data blocks;

(8) and (3) comparing the fingerprint of the currently detected data block with the fingerprints of other blocks under the index, if the fingerprint of the currently detected data block exists, determining the currently detected block as a repeated block, otherwise, determining the currently detected block as a non-repeated block, and returning to the step (2).

Example 2

As shown in fig. 2, a fast retrieval method for variable-length data blocks:

(1) reading a variable-length data block group to be retrieved;

(2) inputting a data block 1 of one variable-length data block group;

(3) extracting the length L and the

byte values

1D,34 and 50 of partial positions of the data block 1;

(4) mapping the length of a data block 1 to a [0,255] interval, and calculating K-min (L,65536) mod 256, wherein K represents the value of mapping the current data block length to the [0,255] interval, L represents the data block length, min (L,65536) represents the smaller value of L and 65536, and is used for dividing the data blocks with the length exceeding 64KB into the same child node, and mod represents modular operation;

(5) sequentially taking K, 1D,34 and 50 as keys to construct an index tree S-K-1D-34-50, wherein S represents a root node of the index tree;

(6) no other data block exists under the current index, and the data block 1 is a unique block;

(7) inputting a data block 2 of one of the variable-length data block groups;

(8) extracting the length L and the

byte values

1D,34 and 50 of partial positions of the data block 2;

(9) mapping the length of the data block 2 to a [0,255] interval, and calculating K-min (L,65536) mod 256;

(10) sequentially taking K, 1D,34 and 50 as keys to construct an index tree S-K-1D-34-50;

(11) other data blocks exist under the current index, and the data block 2 is not a unique block and needs to be compared with the fingerprint;

(12) respectively calculating the fingerprints of the data blocks without fingerprints under the indexes of the data blocks 2 and the fingerprints of the currently detected data blocks;

(13) comparing the fingerprint of the data block 2 with the fingerprints of other data blocks under the index, if the fingerprint exists, determining the currently detected data block as a repeated block, and otherwise, determining the currently detected data block as a non-repeated block;

(14) adding the data block 2 information as a child node of the current node;

(15) inputting a data block 3 of one of the variable-length data block groups;

(16) extracting

byte values

82,9D,12 of the length L and partial positions of the data block 3;

(17) mapping the length of the data block 3 to a [0,255] interval, and calculating K-min (L,65536) mod 256;

(18) sequentially taking K, 82,9D and 12 as keys to construct an index tree S-K-82-9D-12;

(19) there are no other data blocks under the current index, and data block 3 is the only block.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A variable-length data block oriented quick retrieval method is characterized by comprising the following steps:

step S1, reading a variable-length data block group to be retrieved;

step S2, extracting the length and partial position byte value of the data block in the data block group;

2. The method for fast searching for variable-length data blocks according to claim 1,

the step S2 specifically includes:

s2.1, inputting a data block of the variable-length data block group;

and S2.2, if the current input is null, outputting a judgment result and terminating the retrieval, otherwise, extracting the length L of the current data block and the byte values A0, A1, A2 and … of the partial positions.

3. The method for fast retrieving data blocks of variable length according to claim 1, wherein the step S3 specifically includes:

4. The method for fast retrieving data blocks of variable length according to claim 1, wherein the step S4 specifically includes:

5. The variable-length data block oriented fast retrieval method as claimed in claim 1, wherein the data block fingerprint is a secure hash function value of the data block.

6. The method for variable-length block oriented fast retrieval as claimed in claim 1, wherein the extracted partial position byte values are defined as 1 st byte value, last 1 byte value, [ L/2] th byte value]Byte value, 2 ndⁿByte valueWherein L represents a data block length, n is a natural number and 2ⁿ≤L。