CN113495901B - Quick retrieval method for variable-length data blocks - Google Patents
Quick retrieval method for variable-length data blocks Download PDFInfo
- Publication number
- CN113495901B CN113495901B CN202110424974.2A CN202110424974A CN113495901B CN 113495901 B CN113495901 B CN 113495901B CN 202110424974 A CN202110424974 A CN 202110424974A CN 113495901 B CN113495901 B CN 113495901B
- Authority
- CN
- China
- Prior art keywords
- data block
- length
- fingerprints
- current
- block
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a quick retrieval method for variable-length data blocks, which comprises the steps of extracting the length of the data blocks and the byte value of part of positions; constructing an index tree; calculating and comparing fingerprints of the data blocks; the application constructs the index by extracting the length and bytes of the variable-length data block, and realizes the retrieval mode of firstly retrieving conflict and then calculating fingerprints, thereby reducing the fingerprint calculation process and improving the retrieval efficiency.
Description
Technical Field
The application relates to the technical field of data storage, in particular to a quick retrieval method for variable-length data blocks.
Background
An important indicator for measuring the ability of data de-duplication is the overhead, which mainly includes the overhead of fingerprint calculation and the overhead of fingerprint retrieval. In the storage system, as time goes, the stored data becomes larger and larger, and at this time, fingerprint retrieval and comparison occupy a large amount of computing resources, and meanwhile, disk IO is improved, so that retrieval efficiency is reduced. Therefore, search efficiency optimization is a main means of reducing system overhead.
The searching efficiency optimization is mainly realized by means of rapid judgment of a bloom filter, preloading of data by utilizing locality of the data, constructing hierarchical indexes by utilizing similarity of the data, optimizing index structures according to different storage media, optimizing index structures according to stored data types and the like.
At present, the repeated data retrieval mode mainly comprises the following steps: the fingerprint of the data block is calculated first, then the fingerprint index table is compared, and whether the data block is repeated or not is judged. However, the fingerprints of the data blocks generally employ secure hash functions such as SHA256, SHA-3, etc., and computing the fingerprints first takes a lot of computation time.
Disclosure of Invention
The application aims to provide a quick retrieval method for variable-length data blocks, which is used for realizing a retrieval mode of firstly retrieving conflict and then calculating fingerprints by extracting the length and bytes of the variable-length data blocks to construct indexes, thereby reducing the process of calculating fingerprints and improving the retrieval efficiency.
The technical scheme adopted by the application is as follows:
a quick search method for variable-length data blocks comprises the following steps:
s1, reading a variable-length data block group to be retrieved;
s2.1, inputting a data block of the variable-length data block group;
step S2.2, if the current input is empty, outputting a judging result and stopping searching, otherwise extracting the length L of the current data block and byte values A0, A1, A2 and … of partial positions;
step S3, mapping the length of the current data block to the [0,255] interval, and constructing an index tree by taking the mapped value as a first child node of the index tree;
and S4, calculating the fingerprint of the data block, comparing the fingerprints, adding the information of the data block as a child node of the current node, and returning to the step S1.
Further, the step S3 specifically includes:
step S3.1, calculating k=min (L, 65536) mod 256, where K represents a value mapping the current data block length to the [0,255] interval, L represents the data block length, min (L, 65536) represents taking the smaller value of L and 65536, for dividing the data block with a length exceeding 64KB into the same sub-node, mod represents modulo operation;
and S3.2, sequentially taking K, A0, A1, A2 and … as keys, and constructing an index tree S-K-A0-A1-A2- …, wherein S represents a root node of the index tree.
Further, the step S4 specifically includes:
step S4.1, if no other data block exists under the current index, the current detected data block is the unique block, and the step S2 is returned; otherwise, enter step S4.2;
step S4.2, respectively calculating the fingerprints of the data blocks without calculating the fingerprints under the current index and the fingerprints of the data blocks currently detected;
s4.3, comparing the fingerprints of the currently detected data block with fingerprints of other data blocks under the index, if the fingerprints exist, the currently detected data block is a repeated block, otherwise, the currently detected data block is a non-repeated block;
and S4.4, adding the data block information as a child node of the current node, and returning to the step S2.
Further, the data block fingerprint is a secure hash function value of the data block.
Further, the extracted partial position byte value is defined as 1 st byte value, last 1 st byte value, L/2 nd byte value]Byte value, 2 nd n A byte value, wherein L represents a data block length, n is a natural number and 2 n ≤L。
Compared with the prior art, the application has the following beneficial effects:
(1) The application introduces the block size as one of indexes, and can be applied to a content-based block method;
(2) According to the application, the index is constructed according to the block content, and when a plurality of blocks exist under the index, fingerprint comparison is calculated, so that the number of fingerprint calculation times can be reduced, and the retrieval efficiency is improved.
Drawings
FIG. 1 is a flow chart of the present application;
FIG. 2 is a schematic diagram showing the specific steps of example 2.
Detailed Description
The application is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present application, and are not intended to limit the scope of the present application.
Example 1
As shown in fig. 1, a fast search method for variable-length data blocks is as follows:
(1) Inputting a group of blocks to be repeatedly detected;
(2) A block is entered. If the current input is empty, terminating the search; otherwise go to step (3)
(3) Extracting the length L of the block, the first byte A, the last byte B and the [ L/2] th byte C;
(4) Calculate k=min (L, 65536) mod 256;
(5) Sequentially taking K, A, B and C as nodes to construct an index tree S-K-A-B-C, wherein S represents A root node of the index tree;
(6) If no other block exists under the current index, the current detection block is the only block, and the step (2) is returned; otherwise, entering the next step;
(7) Calculating fingerprints of the data blocks with no fingerprints calculated under the current index and the currently detected data blocks;
(8) And (3) comparing the fingerprints of the currently detected data block with fingerprints of other blocks under the index, if the fingerprints exist, the currently detected block is a repeated block, otherwise, the currently detected block is a non-repeated block, and returning to the step (2).
Example 2
As shown in fig. 2, a fast search method for variable-length data blocks:
(1) Reading a variable length data block group to be retrieved;
(2) Inputting a data block 1 of one of the variable-length data block groups;
(3) Extracting byte values 1D,34,50 of the length L and partial positions of the data block 1;
(4) Mapping the length of the data block 1 to the [0,255] interval, and calculating K=min (L, 65536) mod 256, wherein K represents a value for mapping the length of the current data block to the [0,255] interval, L represents the length of the data block, min (L, 65536) represents a smaller value in L and 65536 and is used for dividing the data block with the length exceeding 64KB into the same sub-node, and mod represents modulo operation;
(5) Sequentially taking K,1D,34 and 50 as keys to construct an index tree S-K-1D-34-50, wherein S represents a root node of the index tree;
(6) No other data block exists under the current index, and the data block 1 is a unique block;
(7) Inputting a data block 2 of said variable length data block group;
(8) Extracting byte values 1D,34,50 of the length L and partial positions of the data block 2;
(9) Mapping the length of the data block 2 to the [0,255] interval, and calculating K=min (L, 65536) mod 256;
(10) Sequentially taking K,1D,34 and 50 as keys to construct an index tree S-K-1D-34-50;
(11) Other data blocks exist under the current index, the data block 2 is not the only block, and fingerprints need to be compared;
(12) Respectively calculating the fingerprints of the data blocks without calculating the fingerprints under the index of the data block 2 and the fingerprints of the currently detected data blocks;
(13) Comparing the fingerprints of the data block 2 with the fingerprints of other data blocks under the index, if the fingerprints exist, the currently detected data block is a repeated block, otherwise, the currently detected data block is a non-repeated block;
(14) Adding the data block 2 information as a child node of the current node;
(15) Inputting a data block 3 of one of the variable-length data block groups;
(16) Extracting byte values 82,9D,12 of the length L and partial positions of the data block 3;
(17) Mapping the length of the data block 3 to the [0,255] interval, and calculating K=min (L, 65536) mod 256;
(18) Sequentially taking K,82,9D and 12 as keys to construct an index tree S-K-82-9D-12;
(19) No other data block exists under the current index, and the data block 3 is the only block.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely a preferred embodiment of the present application, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present application, and such modifications and variations should also be regarded as being within the scope of the application.
Claims (2)
1. The quick searching method for the variable-length data block is characterized by comprising the following steps of:
s1, reading a variable-length data block group to be retrieved;
s2, extracting the length and partial position byte values of the data blocks in the data block group;
the extracted partial position byte value is defined as 1 st byte value, last 1 st byte value, L/2 nd byte value]Byte value, 2 nd n A byte value, wherein L represents a data block length, n is a natural number and 2 n ≤L;
The step S2 specifically includes:
s2.1, inputting a data block of the variable-length data block group;
step S2.2, if the current input is empty, outputting a judging result and stopping searching, otherwise extracting the length L of the current data block and byte values A0, A1, A2 and … of partial positions;
step S3, mapping the length of the current data block to the [0,255] interval, and constructing an index tree by taking the mapped value as a first child node of the index tree;
the step S3 specifically includes:
step S3.1, calculating k=min (L, 65536) mod 256, where K represents a value mapping the current data block length to the [0,255] interval, L represents the data block length, min (L, 65536) represents taking the smaller value of L and 65536, for dividing the data block with a length exceeding 64KB into the same sub-node, mod represents modulo operation;
s3.2, constructing an index tree S-K-A0-A1-A2- … by sequentially taking K, A0, A1, A2 and … as keys, wherein S represents a root node of the index tree;
s4, calculating a data block fingerprint, comparing the fingerprints, adding the data block information as a child node of the current node, and returning to the step S1;
the step S4 specifically includes:
step S4.1, if no other data block exists under the current index, the current detected data block is the unique block, and the step S2 is returned; otherwise, enter step S4.2;
step S4.2, respectively calculating the fingerprints of the data blocks without calculating the fingerprints under the current index and the fingerprints of the data blocks currently detected;
s4.3, comparing the fingerprints of the currently detected data block with fingerprints of other data blocks under the index, if the fingerprints exist, the currently detected data block is a repeated block, otherwise, the currently detected data block is a non-repeated block;
and S4.4, adding the data block information as a child node of the current node, and returning to the step S2.
2. The method for fast retrieval of data blocks with variable length according to claim 1, wherein the data block fingerprint is a secure hash function value of the data block.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110424974.2A CN113495901B (en) | 2021-04-20 | 2021-04-20 | Quick retrieval method for variable-length data blocks |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110424974.2A CN113495901B (en) | 2021-04-20 | 2021-04-20 | Quick retrieval method for variable-length data blocks |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113495901A CN113495901A (en) | 2021-10-12 |
CN113495901B true CN113495901B (en) | 2023-10-13 |
Family
ID=77997665
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110424974.2A Active CN113495901B (en) | 2021-04-20 | 2021-04-20 | Quick retrieval method for variable-length data blocks |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113495901B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118043799A (en) * | 2021-12-13 | 2024-05-14 | 华为技术有限公司 | Data management method and device in storage system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102289518A (en) * | 2011-09-13 | 2011-12-21 | 盛乐信息技术(上海)有限公司 | Method and system for updating audio fingerprint search library |
CN103959256A (en) * | 2011-11-28 | 2014-07-30 | 国际商业机器公司 | Fingerprint-based data deduplication |
CN105338297A (en) * | 2014-08-11 | 2016-02-17 | 杭州海康威视系统技术有限公司 | Video data storage and playback system, device and method |
CN111091118A (en) * | 2019-12-31 | 2020-05-01 | 北京奇艺世纪科技有限公司 | Image recognition method and device, electronic equipment and storage medium |
CN112347272A (en) * | 2020-09-18 | 2021-02-09 | 国家计算机网络与信息安全管理中心 | Streaming matching method and device based on audio and video dynamic characteristics |
CN112470140A (en) * | 2018-06-06 | 2021-03-09 | 吴英全 | Block-based deduplication |
-
2021
- 2021-04-20 CN CN202110424974.2A patent/CN113495901B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102289518A (en) * | 2011-09-13 | 2011-12-21 | 盛乐信息技术(上海)有限公司 | Method and system for updating audio fingerprint search library |
CN103959256A (en) * | 2011-11-28 | 2014-07-30 | 国际商业机器公司 | Fingerprint-based data deduplication |
CN105338297A (en) * | 2014-08-11 | 2016-02-17 | 杭州海康威视系统技术有限公司 | Video data storage and playback system, device and method |
CN112470140A (en) * | 2018-06-06 | 2021-03-09 | 吴英全 | Block-based deduplication |
CN111091118A (en) * | 2019-12-31 | 2020-05-01 | 北京奇艺世纪科技有限公司 | Image recognition method and device, electronic equipment and storage medium |
CN112347272A (en) * | 2020-09-18 | 2021-02-09 | 国家计算机网络与信息安全管理中心 | Streaming matching method and device based on audio and video dynamic characteristics |
Also Published As
Publication number | Publication date |
---|---|
CN113495901A (en) | 2021-10-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11169978B2 (en) | Distributed pipeline optimization for data preparation | |
US11461304B2 (en) | Signature-based cache optimization for data preparation | |
JP2017517082A (en) | Parallel decision tree processor architecture | |
CN102129458A (en) | Method and device for storing relational database | |
US10642815B2 (en) | Step editor for data preparation | |
CN106778079A (en) | A kind of DNA sequence dna k mer frequency statistics methods based on MapReduce | |
CN102169491B (en) | Dynamic detection method for multi-data concentrated and repeated records | |
WO2016043757A1 (en) | Data to be backed up in a backup system | |
CN108205571B (en) | Key value data table connection method and device | |
CN110019205B (en) | Data storage and restoration method and device and computer equipment | |
US10740316B2 (en) | Cache optimization for data preparation | |
CN113495901B (en) | Quick retrieval method for variable-length data blocks | |
WO2022199400A1 (en) | Method and apparatus for retrieving persistent memory file system metadata, and storage structure | |
Holt et al. | Constructing Burrows-Wheeler transforms of large string collections via merging | |
CN111026736B (en) | Data blood margin management method and device and data blood margin analysis method and device | |
CN110532284B (en) | Mass data storage and retrieval method and device, computer equipment and storage medium | |
US20070239794A1 (en) | Method and system for updating logical information in databases | |
RU2417424C1 (en) | Method of compensating for multi-dimensional data for storing and searching for information in database management system and device for realising said method | |
CN111752954A (en) | Large-scale feature data storage method and device | |
JP5139335B2 (en) | Data search device, data search method, and data search program | |
CN110543622A (en) | Text similarity detection method and device, electronic equipment and readable storage medium | |
CN112860712B (en) | Block chain-based transaction database construction method, system and electronic equipment | |
CN114386384B (en) | Approximate repetition detection method, system and terminal for large-scale long text data | |
US11288447B2 (en) | Step editor for data preparation | |
KR20170090128A (en) | Index construction and utilization method for processing data based on MapReduce in Hadoop environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |