CN113495901B - Quick retrieval method for variable-length data blocks - Google Patents

Quick retrieval method for variable-length data blocks Download PDF

Info

Publication number
CN113495901B
CN113495901B CN202110424974.2A CN202110424974A CN113495901B CN 113495901 B CN113495901 B CN 113495901B CN 202110424974 A CN202110424974 A CN 202110424974A CN 113495901 B CN113495901 B CN 113495901B
Authority
CN
China
Prior art keywords
data block
length
fingerprints
current
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110424974.2A
Other languages
Chinese (zh)
Other versions
CN113495901A (en
Inventor
徐振楠
吕鑫
吴涛
高晟凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Huaneng Lancang River Hydropower Co Ltd
NetsUnion Clearing Corp
Original Assignee
Hohai University HHU
Huaneng Lancang River Hydropower Co Ltd
NetsUnion Clearing Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU, Huaneng Lancang River Hydropower Co Ltd, NetsUnion Clearing Corp filed Critical Hohai University HHU
Priority to CN202110424974.2A priority Critical patent/CN113495901B/en
Publication of CN113495901A publication Critical patent/CN113495901A/en
Application granted granted Critical
Publication of CN113495901B publication Critical patent/CN113495901B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a quick retrieval method for variable-length data blocks, which comprises the steps of extracting the length of the data blocks and the byte value of part of positions; constructing an index tree; calculating and comparing fingerprints of the data blocks; the application constructs the index by extracting the length and bytes of the variable-length data block, and realizes the retrieval mode of firstly retrieving conflict and then calculating fingerprints, thereby reducing the fingerprint calculation process and improving the retrieval efficiency.

Description

Quick retrieval method for variable-length data blocks
Technical Field
The application relates to the technical field of data storage, in particular to a quick retrieval method for variable-length data blocks.
Background
An important indicator for measuring the ability of data de-duplication is the overhead, which mainly includes the overhead of fingerprint calculation and the overhead of fingerprint retrieval. In the storage system, as time goes, the stored data becomes larger and larger, and at this time, fingerprint retrieval and comparison occupy a large amount of computing resources, and meanwhile, disk IO is improved, so that retrieval efficiency is reduced. Therefore, search efficiency optimization is a main means of reducing system overhead.
The searching efficiency optimization is mainly realized by means of rapid judgment of a bloom filter, preloading of data by utilizing locality of the data, constructing hierarchical indexes by utilizing similarity of the data, optimizing index structures according to different storage media, optimizing index structures according to stored data types and the like.
At present, the repeated data retrieval mode mainly comprises the following steps: the fingerprint of the data block is calculated first, then the fingerprint index table is compared, and whether the data block is repeated or not is judged. However, the fingerprints of the data blocks generally employ secure hash functions such as SHA256, SHA-3, etc., and computing the fingerprints first takes a lot of computation time.
Disclosure of Invention
The application aims to provide a quick retrieval method for variable-length data blocks, which is used for realizing a retrieval mode of firstly retrieving conflict and then calculating fingerprints by extracting the length and bytes of the variable-length data blocks to construct indexes, thereby reducing the process of calculating fingerprints and improving the retrieval efficiency.
The technical scheme adopted by the application is as follows:
a quick search method for variable-length data blocks comprises the following steps:
s1, reading a variable-length data block group to be retrieved;
s2.1, inputting a data block of the variable-length data block group;
step S2.2, if the current input is empty, outputting a judging result and stopping searching, otherwise extracting the length L of the current data block and byte values A0, A1, A2 and … of partial positions;
step S3, mapping the length of the current data block to the [0,255] interval, and constructing an index tree by taking the mapped value as a first child node of the index tree;
and S4, calculating the fingerprint of the data block, comparing the fingerprints, adding the information of the data block as a child node of the current node, and returning to the step S1.
Further, the step S3 specifically includes:
step S3.1, calculating k=min (L, 65536) mod 256, where K represents a value mapping the current data block length to the [0,255] interval, L represents the data block length, min (L, 65536) represents taking the smaller value of L and 65536, for dividing the data block with a length exceeding 64KB into the same sub-node, mod represents modulo operation;
and S3.2, sequentially taking K, A0, A1, A2 and … as keys, and constructing an index tree S-K-A0-A1-A2- …, wherein S represents a root node of the index tree.
Further, the step S4 specifically includes:
step S4.1, if no other data block exists under the current index, the current detected data block is the unique block, and the step S2 is returned; otherwise, enter step S4.2;
step S4.2, respectively calculating the fingerprints of the data blocks without calculating the fingerprints under the current index and the fingerprints of the data blocks currently detected;
s4.3, comparing the fingerprints of the currently detected data block with fingerprints of other data blocks under the index, if the fingerprints exist, the currently detected data block is a repeated block, otherwise, the currently detected data block is a non-repeated block;
and S4.4, adding the data block information as a child node of the current node, and returning to the step S2.
Further, the data block fingerprint is a secure hash function value of the data block.
Further, the extracted partial position byte value is defined as 1 st byte value, last 1 st byte value, L/2 nd byte value]Byte value, 2 nd n A byte value, wherein L represents a data block length, n is a natural number and 2 n ≤L。
Compared with the prior art, the application has the following beneficial effects:
(1) The application introduces the block size as one of indexes, and can be applied to a content-based block method;
(2) According to the application, the index is constructed according to the block content, and when a plurality of blocks exist under the index, fingerprint comparison is calculated, so that the number of fingerprint calculation times can be reduced, and the retrieval efficiency is improved.
Drawings
FIG. 1 is a flow chart of the present application;
FIG. 2 is a schematic diagram showing the specific steps of example 2.
Detailed Description
The application is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present application, and are not intended to limit the scope of the present application.
Example 1
As shown in fig. 1, a fast search method for variable-length data blocks is as follows:
(1) Inputting a group of blocks to be repeatedly detected;
(2) A block is entered. If the current input is empty, terminating the search; otherwise go to step (3)
(3) Extracting the length L of the block, the first byte A, the last byte B and the [ L/2] th byte C;
(4) Calculate k=min (L, 65536) mod 256;
(5) Sequentially taking K, A, B and C as nodes to construct an index tree S-K-A-B-C, wherein S represents A root node of the index tree;
(6) If no other block exists under the current index, the current detection block is the only block, and the step (2) is returned; otherwise, entering the next step;
(7) Calculating fingerprints of the data blocks with no fingerprints calculated under the current index and the currently detected data blocks;
(8) And (3) comparing the fingerprints of the currently detected data block with fingerprints of other blocks under the index, if the fingerprints exist, the currently detected block is a repeated block, otherwise, the currently detected block is a non-repeated block, and returning to the step (2).
Example 2
As shown in fig. 2, a fast search method for variable-length data blocks:
(1) Reading a variable length data block group to be retrieved;
(2) Inputting a data block 1 of one of the variable-length data block groups;
(3) Extracting byte values 1D,34,50 of the length L and partial positions of the data block 1;
(4) Mapping the length of the data block 1 to the [0,255] interval, and calculating K=min (L, 65536) mod 256, wherein K represents a value for mapping the length of the current data block to the [0,255] interval, L represents the length of the data block, min (L, 65536) represents a smaller value in L and 65536 and is used for dividing the data block with the length exceeding 64KB into the same sub-node, and mod represents modulo operation;
(5) Sequentially taking K,1D,34 and 50 as keys to construct an index tree S-K-1D-34-50, wherein S represents a root node of the index tree;
(6) No other data block exists under the current index, and the data block 1 is a unique block;
(7) Inputting a data block 2 of said variable length data block group;
(8) Extracting byte values 1D,34,50 of the length L and partial positions of the data block 2;
(9) Mapping the length of the data block 2 to the [0,255] interval, and calculating K=min (L, 65536) mod 256;
(10) Sequentially taking K,1D,34 and 50 as keys to construct an index tree S-K-1D-34-50;
(11) Other data blocks exist under the current index, the data block 2 is not the only block, and fingerprints need to be compared;
(12) Respectively calculating the fingerprints of the data blocks without calculating the fingerprints under the index of the data block 2 and the fingerprints of the currently detected data blocks;
(13) Comparing the fingerprints of the data block 2 with the fingerprints of other data blocks under the index, if the fingerprints exist, the currently detected data block is a repeated block, otherwise, the currently detected data block is a non-repeated block;
(14) Adding the data block 2 information as a child node of the current node;
(15) Inputting a data block 3 of one of the variable-length data block groups;
(16) Extracting byte values 82,9D,12 of the length L and partial positions of the data block 3;
(17) Mapping the length of the data block 3 to the [0,255] interval, and calculating K=min (L, 65536) mod 256;
(18) Sequentially taking K,82,9D and 12 as keys to construct an index tree S-K-82-9D-12;
(19) No other data block exists under the current index, and the data block 3 is the only block.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely a preferred embodiment of the present application, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present application, and such modifications and variations should also be regarded as being within the scope of the application.

Claims (2)

1. The quick searching method for the variable-length data block is characterized by comprising the following steps of:
s1, reading a variable-length data block group to be retrieved;
s2, extracting the length and partial position byte values of the data blocks in the data block group;
the extracted partial position byte value is defined as 1 st byte value, last 1 st byte value, L/2 nd byte value]Byte value, 2 nd n A byte value, wherein L represents a data block length, n is a natural number and 2 n ≤L;
The step S2 specifically includes:
s2.1, inputting a data block of the variable-length data block group;
step S2.2, if the current input is empty, outputting a judging result and stopping searching, otherwise extracting the length L of the current data block and byte values A0, A1, A2 and … of partial positions;
step S3, mapping the length of the current data block to the [0,255] interval, and constructing an index tree by taking the mapped value as a first child node of the index tree;
the step S3 specifically includes:
step S3.1, calculating k=min (L, 65536) mod 256, where K represents a value mapping the current data block length to the [0,255] interval, L represents the data block length, min (L, 65536) represents taking the smaller value of L and 65536, for dividing the data block with a length exceeding 64KB into the same sub-node, mod represents modulo operation;
s3.2, constructing an index tree S-K-A0-A1-A2- … by sequentially taking K, A0, A1, A2 and … as keys, wherein S represents a root node of the index tree;
s4, calculating a data block fingerprint, comparing the fingerprints, adding the data block information as a child node of the current node, and returning to the step S1;
the step S4 specifically includes:
step S4.1, if no other data block exists under the current index, the current detected data block is the unique block, and the step S2 is returned; otherwise, enter step S4.2;
step S4.2, respectively calculating the fingerprints of the data blocks without calculating the fingerprints under the current index and the fingerprints of the data blocks currently detected;
s4.3, comparing the fingerprints of the currently detected data block with fingerprints of other data blocks under the index, if the fingerprints exist, the currently detected data block is a repeated block, otherwise, the currently detected data block is a non-repeated block;
and S4.4, adding the data block information as a child node of the current node, and returning to the step S2.
2. The method for fast retrieval of data blocks with variable length according to claim 1, wherein the data block fingerprint is a secure hash function value of the data block.
CN202110424974.2A 2021-04-20 2021-04-20 Quick retrieval method for variable-length data blocks Active CN113495901B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110424974.2A CN113495901B (en) 2021-04-20 2021-04-20 Quick retrieval method for variable-length data blocks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110424974.2A CN113495901B (en) 2021-04-20 2021-04-20 Quick retrieval method for variable-length data blocks

Publications (2)

Publication Number Publication Date
CN113495901A CN113495901A (en) 2021-10-12
CN113495901B true CN113495901B (en) 2023-10-13

Family

ID=77997665

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110424974.2A Active CN113495901B (en) 2021-04-20 2021-04-20 Quick retrieval method for variable-length data blocks

Country Status (1)

Country Link
CN (1) CN113495901B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118043799A (en) * 2021-12-13 2024-05-14 华为技术有限公司 Data management method and device in storage system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289518A (en) * 2011-09-13 2011-12-21 盛乐信息技术(上海)有限公司 Method and system for updating audio fingerprint search library
CN103959256A (en) * 2011-11-28 2014-07-30 国际商业机器公司 Fingerprint-based data deduplication
CN105338297A (en) * 2014-08-11 2016-02-17 杭州海康威视系统技术有限公司 Video data storage and playback system, device and method
CN111091118A (en) * 2019-12-31 2020-05-01 北京奇艺世纪科技有限公司 Image recognition method and device, electronic equipment and storage medium
CN112347272A (en) * 2020-09-18 2021-02-09 国家计算机网络与信息安全管理中心 Streaming matching method and device based on audio and video dynamic characteristics
CN112470140A (en) * 2018-06-06 2021-03-09 吴英全 Block-based deduplication

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289518A (en) * 2011-09-13 2011-12-21 盛乐信息技术(上海)有限公司 Method and system for updating audio fingerprint search library
CN103959256A (en) * 2011-11-28 2014-07-30 国际商业机器公司 Fingerprint-based data deduplication
CN105338297A (en) * 2014-08-11 2016-02-17 杭州海康威视系统技术有限公司 Video data storage and playback system, device and method
CN112470140A (en) * 2018-06-06 2021-03-09 吴英全 Block-based deduplication
CN111091118A (en) * 2019-12-31 2020-05-01 北京奇艺世纪科技有限公司 Image recognition method and device, electronic equipment and storage medium
CN112347272A (en) * 2020-09-18 2021-02-09 国家计算机网络与信息安全管理中心 Streaming matching method and device based on audio and video dynamic characteristics

Also Published As

Publication number Publication date
CN113495901A (en) 2021-10-12

Similar Documents

Publication Publication Date Title
US11169978B2 (en) Distributed pipeline optimization for data preparation
US11461304B2 (en) Signature-based cache optimization for data preparation
JP2017517082A (en) Parallel decision tree processor architecture
CN102129458A (en) Method and device for storing relational database
US10642815B2 (en) Step editor for data preparation
CN106778079A (en) A kind of DNA sequence dna k mer frequency statistics methods based on MapReduce
CN102169491B (en) Dynamic detection method for multi-data concentrated and repeated records
WO2016043757A1 (en) Data to be backed up in a backup system
CN108205571B (en) Key value data table connection method and device
CN110019205B (en) Data storage and restoration method and device and computer equipment
US10740316B2 (en) Cache optimization for data preparation
CN113495901B (en) Quick retrieval method for variable-length data blocks
WO2022199400A1 (en) Method and apparatus for retrieving persistent memory file system metadata, and storage structure
Holt et al. Constructing Burrows-Wheeler transforms of large string collections via merging
CN111026736B (en) Data blood margin management method and device and data blood margin analysis method and device
CN110532284B (en) Mass data storage and retrieval method and device, computer equipment and storage medium
US20070239794A1 (en) Method and system for updating logical information in databases
RU2417424C1 (en) Method of compensating for multi-dimensional data for storing and searching for information in database management system and device for realising said method
CN111752954A (en) Large-scale feature data storage method and device
JP5139335B2 (en) Data search device, data search method, and data search program
CN110543622A (en) Text similarity detection method and device, electronic equipment and readable storage medium
CN112860712B (en) Block chain-based transaction database construction method, system and electronic equipment
CN114386384B (en) Approximate repetition detection method, system and terminal for large-scale long text data
US11288447B2 (en) Step editor for data preparation
KR20170090128A (en) Index construction and utilization method for processing data based on MapReduce in Hadoop environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant