CN113495901A - Variable-length data block oriented quick retrieval method - Google Patents

Variable-length data block oriented quick retrieval method Download PDF

Info

Publication number
CN113495901A
CN113495901A CN202110424974.2A CN202110424974A CN113495901A CN 113495901 A CN113495901 A CN 113495901A CN 202110424974 A CN202110424974 A CN 202110424974A CN 113495901 A CN113495901 A CN 113495901A
Authority
CN
China
Prior art keywords
data block
length
variable
block
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110424974.2A
Other languages
Chinese (zh)
Other versions
CN113495901B (en
Inventor
徐振楠
吕鑫
吴涛
高晟凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Huaneng Lancang River Hydropower Co Ltd
NetsUnion Clearing Corp
Original Assignee
Hohai University HHU
Huaneng Lancang River Hydropower Co Ltd
NetsUnion Clearing Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU, Huaneng Lancang River Hydropower Co Ltd, NetsUnion Clearing Corp filed Critical Hohai University HHU
Priority to CN202110424974.2A priority Critical patent/CN113495901B/en
Publication of CN113495901A publication Critical patent/CN113495901A/en
Application granted granted Critical
Publication of CN113495901B publication Critical patent/CN113495901B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a variable-length data block oriented quick retrieval method, which comprises the steps of extracting the length of a data block and partial position byte values; constructing an index tree; calculating and comparing the fingerprints of the data blocks; the invention realizes the retrieval mode of firstly retrieving conflict and then calculating the fingerprint by extracting the length and the byte of the variable-length data block to construct the index, thereby reducing the process of fingerprint calculation and improving the retrieval efficiency.

Description

Variable-length data block oriented quick retrieval method
Technical Field
The invention relates to the technical field of data storage, in particular to a quick retrieval method for variable-length data blocks.
Background
An important index for measuring the deduplication capability is system overhead, which mainly includes overhead of fingerprint calculation and overhead of fingerprint retrieval. In a storage system, the stored data will be larger and larger along with the migration of time, and at this time, the fingerprint retrieval and comparison not only occupies a large amount of computing resources, but also improves the disk IO, resulting in a reduction in retrieval efficiency. Therefore, search efficiency optimization is a major means to reduce system overhead.
The retrieval efficiency optimization mainly depends on the means of bloom filter rapid judgment, data local preloading, hierarchical index construction by data similarity, index structure optimization according to different storage media, index structure optimization according to the type of stored data and the like.
At present, the repeated data retrieval mode is mainly as follows: firstly, calculating the fingerprint of the data block, then comparing the fingerprint index table and judging whether the data block is repeated. However, the fingerprints of the data blocks generally adopt secure hash functions such as SHA256, SHA-3 and the like, and the calculation of the fingerprints first takes a lot of calculation time.
Disclosure of Invention
The invention aims to provide a variable-length data block-oriented quick retrieval method, which realizes a retrieval mode of firstly retrieving conflicts and then calculating fingerprints by extracting the length and the byte of a variable-length data block to construct an index, thereby reducing the process of fingerprint calculation and improving the retrieval efficiency.
The technical scheme adopted by the invention is as follows:
a quick retrieval method facing variable-length data blocks comprises the following steps:
step S1, reading a variable-length data block group to be retrieved;
s2.1, inputting a data block of the variable-length data block group;
s2.2, if the current input is null, outputting a judgment result and terminating the retrieval, otherwise, extracting the length L of the current data block and the byte values A0, A1, A2 and … of the partial positions;
step S3, mapping the length of the current data block to the interval [0,255], and constructing an index tree by taking the mapped value as the first child node of the index tree;
and step S4, calculating the fingerprint of the data block, comparing the fingerprint, adding the data block information as the child node of the current node, and returning to the step S1.
Further, in step S3, specifically, the step includes:
step S3.1, calculating K ═ min (L,65536) mod 256, where K denotes a value that maps the current data block length to the [0,255] interval, L denotes a data block length, min (L,65536) denotes taking the smaller value of L and 65536 for dividing a data block with a length exceeding 64KB into the same child node, and mod denotes a modulo operation;
and S3.2, sequentially using K, A0, A1, A2 and … as keys to construct an index tree S-K-A0-A1-A2- …, wherein S represents a root node of the index tree.
Further, in step S4, specifically, the step includes:
s4.1, if no other data block exists under the current index, the currently detected data block is the only block, and the step S2 is returned; otherwise, go to step S4.2;
s4.2, respectively calculating the fingerprints of the data blocks without the fingerprints under the current index and the fingerprints of the currently detected data blocks;
s4.3, comparing the fingerprint of the currently detected data block with the fingerprints of other data blocks under the index, if the fingerprint exists, determining that the currently detected data block is a repeated block, and if not, determining that the currently detected data block is a non-repeated block;
and S4.4, adding the data block information as a child node of the current node, and returning to the step S2.
Further, the data block fingerprint is a secure hash function value of the data block.
Further, the extracted partial position byte value is defined as the 1 st byte value, the last 1 byte value, and the [ L/2] th byte value]Byte value, 2 ndnA byte value, where L represents a data block length, n is a natural number and 2n≤L。
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention introduces the block size as one of indexes, and can be applied to a content-based blocking method;
(2) according to the method, the index is constructed according to the block content, when a plurality of blocks exist under the index, the fingerprint comparison needs to be calculated, the times of fingerprint calculation can be reduced, and the retrieval efficiency is improved.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram showing the detailed steps of example 2.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
Example 1
As shown in fig. 1, a fast retrieval method for variable-length data blocks:
(1) inputting a group of blocks needing repeated detection;
(2) a block is input. If the current input is null, terminating the search; otherwise, go to step (3)
(3) Extracting the length L, the first byte A, the last byte B and the [ L/2] th byte C of the block;
(4) calculating K min (L,65536) mod 256;
(5) sequentially taking K, A, B and C as nodes to construct an index tree S-K-A-B-C, wherein S represents a root node of the index tree;
(6) if no other block exists under the current index, the current detection block is the only block, and the step (2) is returned; otherwise, entering the next step;
(7) calculating fingerprints of data blocks of which fingerprints are not calculated under the current index and the currently detected data blocks;
(8) and (3) comparing the fingerprint of the currently detected data block with the fingerprints of other blocks under the index, if the fingerprint of the currently detected data block exists, determining the currently detected block as a repeated block, otherwise, determining the currently detected block as a non-repeated block, and returning to the step (2).
Example 2
As shown in fig. 2, a fast retrieval method for variable-length data blocks:
(1) reading a variable-length data block group to be retrieved;
(2) inputting a data block 1 of one variable-length data block group;
(3) extracting the length L and the byte values 1D,34 and 50 of partial positions of the data block 1;
(4) mapping the length of a data block 1 to a [0,255] interval, and calculating K-min (L,65536) mod 256, wherein K represents the value of mapping the current data block length to the [0,255] interval, L represents the data block length, min (L,65536) represents the smaller value of L and 65536, and is used for dividing the data blocks with the length exceeding 64KB into the same child node, and mod represents modular operation;
(5) sequentially taking K, 1D,34 and 50 as keys to construct an index tree S-K-1D-34-50, wherein S represents a root node of the index tree;
(6) no other data block exists under the current index, and the data block 1 is a unique block;
(7) inputting a data block 2 of one of the variable-length data block groups;
(8) extracting the length L and the byte values 1D,34 and 50 of partial positions of the data block 2;
(9) mapping the length of the data block 2 to a [0,255] interval, and calculating K-min (L,65536) mod 256;
(10) sequentially taking K, 1D,34 and 50 as keys to construct an index tree S-K-1D-34-50;
(11) other data blocks exist under the current index, and the data block 2 is not a unique block and needs to be compared with the fingerprint;
(12) respectively calculating the fingerprints of the data blocks without fingerprints under the indexes of the data blocks 2 and the fingerprints of the currently detected data blocks;
(13) comparing the fingerprint of the data block 2 with the fingerprints of other data blocks under the index, if the fingerprint exists, determining the currently detected data block as a repeated block, and otherwise, determining the currently detected data block as a non-repeated block;
(14) adding the data block 2 information as a child node of the current node;
(15) inputting a data block 3 of one of the variable-length data block groups;
(16) extracting byte values 82,9D,12 of the length L and partial positions of the data block 3;
(17) mapping the length of the data block 3 to a [0,255] interval, and calculating K-min (L,65536) mod 256;
(18) sequentially taking K, 82,9D and 12 as keys to construct an index tree S-K-82-9D-12;
(19) there are no other data blocks under the current index, and data block 3 is the only block.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (6)

1. A variable-length data block oriented quick retrieval method is characterized by comprising the following steps:
step S1, reading a variable-length data block group to be retrieved;
step S2, extracting the length and partial position byte value of the data block in the data block group;
step S3, mapping the length of the current data block to the interval [0,255], and constructing an index tree by taking the mapped value as the first child node of the index tree;
and step S4, calculating the fingerprint of the data block, comparing the fingerprint, adding the data block information as the child node of the current node, and returning to the step S1.
2. The method for fast searching for variable-length data blocks according to claim 1,
the step S2 specifically includes:
s2.1, inputting a data block of the variable-length data block group;
and S2.2, if the current input is null, outputting a judgment result and terminating the retrieval, otherwise, extracting the length L of the current data block and the byte values A0, A1, A2 and … of the partial positions.
3. The method for fast retrieving data blocks of variable length according to claim 1, wherein the step S3 specifically includes:
step S3.1, calculating K ═ min (L,65536) mod 256, where K denotes a value that maps the current data block length to the [0,255] interval, L denotes a data block length, min (L,65536) denotes taking the smaller value of L and 65536 for dividing a data block with a length exceeding 64KB into the same child node, and mod denotes a modulo operation;
and S3.2, sequentially using K, A0, A1, A2 and … as keys to construct an index tree S-K-A0-A1-A2- …, wherein S represents a root node of the index tree.
4. The method for fast retrieving data blocks of variable length according to claim 1, wherein the step S4 specifically includes:
s4.1, if no other data block exists under the current index, the currently detected data block is the only block, and the step S2 is returned; otherwise, go to step S4.2;
s4.2, respectively calculating the fingerprints of the data blocks without the fingerprints under the current index and the fingerprints of the currently detected data blocks;
s4.3, comparing the fingerprint of the currently detected data block with the fingerprints of other data blocks under the index, if the fingerprint exists, determining that the currently detected data block is a repeated block, and if not, determining that the currently detected data block is a non-repeated block;
and S4.4, adding the data block information as a child node of the current node, and returning to the step S2.
5. The variable-length data block oriented fast retrieval method as claimed in claim 1, wherein the data block fingerprint is a secure hash function value of the data block.
6. The method for variable-length block oriented fast retrieval as claimed in claim 1, wherein the extracted partial position byte values are defined as 1 st byte value, last 1 byte value, [ L/2] th byte value]Byte value, 2 ndnByte valueWherein L represents a data block length, n is a natural number and 2n≤L。
CN202110424974.2A 2021-04-20 2021-04-20 Quick retrieval method for variable-length data blocks Active CN113495901B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110424974.2A CN113495901B (en) 2021-04-20 2021-04-20 Quick retrieval method for variable-length data blocks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110424974.2A CN113495901B (en) 2021-04-20 2021-04-20 Quick retrieval method for variable-length data blocks

Publications (2)

Publication Number Publication Date
CN113495901A true CN113495901A (en) 2021-10-12
CN113495901B CN113495901B (en) 2023-10-13

Family

ID=77997665

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110424974.2A Active CN113495901B (en) 2021-04-20 2021-04-20 Quick retrieval method for variable-length data blocks

Country Status (1)

Country Link
CN (1) CN113495901B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023108360A1 (en) * 2021-12-13 2023-06-22 华为技术有限公司 Method and apparatus for managing data in storage system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289518A (en) * 2011-09-13 2011-12-21 盛乐信息技术(上海)有限公司 Method and system for updating audio fingerprint search library
CN103959256A (en) * 2011-11-28 2014-07-30 国际商业机器公司 Fingerprint-based data deduplication
CN105338297A (en) * 2014-08-11 2016-02-17 杭州海康威视系统技术有限公司 Video data storage and playback system, device and method
CN111091118A (en) * 2019-12-31 2020-05-01 北京奇艺世纪科技有限公司 Image recognition method and device, electronic equipment and storage medium
CN112347272A (en) * 2020-09-18 2021-02-09 国家计算机网络与信息安全管理中心 Streaming matching method and device based on audio and video dynamic characteristics
CN112470140A (en) * 2018-06-06 2021-03-09 吴英全 Block-based deduplication

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289518A (en) * 2011-09-13 2011-12-21 盛乐信息技术(上海)有限公司 Method and system for updating audio fingerprint search library
CN103959256A (en) * 2011-11-28 2014-07-30 国际商业机器公司 Fingerprint-based data deduplication
CN105338297A (en) * 2014-08-11 2016-02-17 杭州海康威视系统技术有限公司 Video data storage and playback system, device and method
CN112470140A (en) * 2018-06-06 2021-03-09 吴英全 Block-based deduplication
CN111091118A (en) * 2019-12-31 2020-05-01 北京奇艺世纪科技有限公司 Image recognition method and device, electronic equipment and storage medium
CN112347272A (en) * 2020-09-18 2021-02-09 国家计算机网络与信息安全管理中心 Streaming matching method and device based on audio and video dynamic characteristics

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023108360A1 (en) * 2021-12-13 2023-06-22 华为技术有限公司 Method and apparatus for managing data in storage system

Also Published As

Publication number Publication date
CN113495901B (en) 2023-10-13

Similar Documents

Publication Publication Date Title
CN110019218B (en) Data storage and query method and equipment
CN107038206B (en) LSM tree establishing method, LSM tree data reading method and server
WO2018064962A1 (en) Data storage method, electronic device and computer non-volatile storage medium
CN109325032B (en) Index data storage and retrieval method, device and storage medium
CN109376196B (en) Method and device for batch synchronization of redo logs
US10642814B2 (en) Signature-based cache optimization for data preparation
CN102129458A (en) Method and device for storing relational database
JPH02109167A (en) Method and device for retrieving character string
CN106778079A (en) A kind of DNA sequence dna k mer frequency statistics methods based on MapReduce
WO2016043757A1 (en) Data to be backed up in a backup system
CN108205571B (en) Key value data table connection method and device
CN114780502B (en) Database method, system, device and medium based on compressed data direct computation
US20170109389A1 (en) Step editor for data preparation
CN106557571A (en) A kind of data duplicate removal method and device based on K V storage engines
Sirén Burrows-Wheeler transform for terabases
CN102867049A (en) Chinese PINYIN quick word segmentation method based on word search tree
CN108229573B (en) Classification calculation method and device based on decision tree
CN110019205B (en) Data storage and restoration method and device and computer equipment
WO2022199400A1 (en) Method and apparatus for retrieving persistent memory file system metadata, and storage structure
CN113495901B (en) Quick retrieval method for variable-length data blocks
WO2024078122A1 (en) Database table scanning method and apparatus, and device
CN111026736B (en) Data blood margin management method and device and data blood margin analysis method and device
CN103761298A (en) Distributed-architecture-based entity matching method
CN111752954B (en) Large-scale feature data storage method and device
JP5139335B2 (en) Data search device, data search method, and data search program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant