CN107704472B

CN107704472B - Method and device for searching data block

Info

Publication number: CN107704472B
Application number: CN201610648299.0A
Authority: CN
Inventors: 关坤; 冷继南; 沈建强; 王工艺
Original assignee: Huawei Technologies Co Ltd
Current assignee: Chengdu Huawei Technology Co Ltd
Priority date: 2016-08-09
Filing date: 2016-08-09
Publication date: 2020-07-24
Anticipated expiration: 2036-08-09
Also published as: CN107704472A

Abstract

The embodiment of the invention discloses a method and a device for searching a data block, relates to the field of data detection, and aims to solve the problem that a large amount of data redundancy exists in a multi-level searching structure in the prior art. The method comprises the following steps: acquiring a K-level characteristic fingerprint of a first data block; determining a target index according to the K-level characteristic fingerprint and a combined search structure, determining a target data block according to the target index, wherein the target data block is the second data block with the highest similarity level with the first data block in M second data blocks corresponding to the M indexes in the combined search structure, the similarity level of the target data block and the first data block is greater than 0, the combined search structure comprises M corresponding relations, one corresponding relation is the corresponding relation between the characteristic fingerprint of one second data block and the index of the second data block, the level number of the characteristic fingerprint of each second data block is greater than or equal to K, and the total number of the indexes in the combined search structure is the same as the total number of the second data blocks.

Description

Method and device for searching data block

Technical Field

The embodiment of the invention relates to the field of data detection, in particular to a method and a device for searching a data block.

Background

The data detection technology is widely applied to the technical fields of internet, image recognition, big data analysis, data reduction and the like, wherein the same and/or similar data search is an important link in the data detection technology. At present, data search based on a single characteristic fingerprint can be performed by using mature search methods such as search tree and hash table search, and as the same data or similar data is searched according to the single characteristic fingerprint, the compression rate of the data cannot be necessarily improved, and in the field of data reduction, a data search scenario based on a plurality of characteristic fingerprints often occurs, for example, a deduplication & delta compression technology scenario, and therefore, a multi-stage search structure needs to be deployed to search the same data and the similar data.

The current multi-level lookup structure is shown in fig. 1, the multi-level lookup structure shown in fig. 1 is an N (N is an integer greater than 1) level lookup structure, the one-level lookup structure is a lookup structure, each lookup structure includes M (M is an integer greater than 1) feature fingerprints and M pointers corresponding to the M feature fingerprints one by one, the M pointers are M pointers corresponding to M data blocks, the pointers of the data blocks are used for pointing to addresses of the data blocks, as shown in fig. 1, Fpn-Bm represents the nth (N is an integer greater than 0 and less than or equal to N) feature fingerprints of the mth (M is an integer greater than 0 and less than or equal to M) data block in the M data blocks, I-Bm represents the pointer of the mth data block, Bm represents the mth data block, as can be known from fig. 1, since the N feature fingerprints of a data block each correspond to a pointer of the data block, therefore, each lookup structure needs to include M pointers, and there is a large amount of redundant data.

Disclosure of Invention

The embodiment of the invention provides a method and a device for searching a data block, which are used for solving the problem that a large amount of data redundancy exists in a multi-level searching structure in the prior art.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

in a first aspect, a method for searching a data block is provided, including: acquiring K-level feature fingerprints of the first data block, wherein the level of the kth-level feature fingerprint in the K-level feature fingerprints is higher than the level of the kth-1-level feature fingerprint in the K-level feature fingerprints, K is an integer larger than 0, and K is an integer larger than 0 and smaller than or equal to K; determining a target index according to the K-level characteristic fingerprint and a combined search structure, determining a target data block according to the target index, wherein the target data block is one of M second data blocks corresponding to M indexes in the combined search structure and having the highest similarity level with the first data block, the similarity level of the target data block and the first data block is greater than 0, the combined search structure comprises M corresponding relations, one corresponding relation is the corresponding relation between the characteristic fingerprint of one second data block and the index of the second data block, the level number of the characteristic fingerprint of each second data block is greater than or equal to K, the total number of the indexes in the combined search structure is the same as the total number of the second data blocks, and M is an integer greater than 0; wherein the similarity level is used to indicate the degree of correlation of the second data block with the first data block.

The method provided by the first aspect can search the same data or similar data through the joint search structure, and because the total number of indexes included in the joint search structure is the same as the total number of second data blocks, and all feature fingerprints of one second data block correspond to only one index of the second data block, compared with the scheme in the prior art, the redundancy of data can be greatly reduced.

With reference to the first aspect, in a first possible implementation manner, the feature fingerprints of the first data block and the second data block include similar feature fingerprints, or the feature fingerprints of the first data block and the second data block include similar feature fingerprints and the same feature fingerprint.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner, determining a target index according to a K-level feature fingerprint and a joint search structure includes: matching the K-level characteristic fingerprint with the characteristic fingerprint in the first corresponding relation in the combined search structure; if the matching is successful, determining the index in the first corresponding relation as a target index; if the matching fails and the K-level feature fingerprint is larger than the feature fingerprints in the first corresponding relationship, recording the matching level of the K-level feature fingerprint and the feature fingerprints in the first corresponding relationship, determining that the corresponding relationship which is adjacent to the first corresponding relationship and larger than the feature fingerprints in the first corresponding relationship in the combined search structure is a new first corresponding relationship, and continuously matching the K-level feature fingerprint with the feature fingerprints in the new first corresponding relationship; if the matching fails and the K-level feature fingerprints are smaller than the feature fingerprints in the first corresponding relationship, recording the matching levels of the K-level feature fingerprints and the feature fingerprints in the first corresponding relationship, determining that the corresponding relationship which is adjacent to the first corresponding relationship and smaller than the feature fingerprints in the first corresponding relationship in the combined search structure is a new first corresponding relationship, continuously matching the K-level feature fingerprints and the new first corresponding relationship until a target index is determined, if all the first corresponding relationships fail to match the K-level feature fingerprints, wherein the target index is an index in the first corresponding relationship which has the highest matching level with the K-level feature fingerprints in all the first corresponding relationships, and the initial first corresponding relationship is the 1 st corresponding relationship in the combined search structure; matching the K-level feature fingerprints with feature fingerprints in a first correspondence in a joint lookup structure, comprising: and starting from the 1 st level feature fingerprint in the K level feature fingerprints, sequentially matching with the feature fingerprints at the same level in the first corresponding relation, wherein if the K level feature fingerprint in the K level feature fingerprint is successfully matched with the K level feature fingerprint in the first corresponding relation, the matching is successful, and otherwise, the matching is failed.

With reference to the first possible implementation manner of the first aspect, in a third possible implementation manner, determining a target index according to a K-level feature fingerprint and a joint search structure includes: matching the a-th level feature fingerprint in the K-level feature fingerprints with the a-th level feature fingerprint in M corresponding relations in the combined search structure, determining all corresponding relations matched with the a-th level feature fingerprint in the K-level feature fingerprints, enabling a to be a +1, matching the a-th level feature fingerprint in the K-level feature fingerprints with the a-th level feature fingerprint in all corresponding relations, determining all corresponding relations matched with the a-th level feature fingerprint in the K-level feature fingerprints until an index in the corresponding relation with the highest matching level of the K-level feature fingerprints is determined as a target index, and enabling an initial value of a to be 1.

By adopting the second possible implementation mode and the third possible implementation mode to search the data blocks, the K-level feature fingerprints of the first data blocks do not need to be matched with the feature fingerprints of all the second data blocks, and the searching efficiency can be improved.

With reference to the first aspect and any one of the first possible implementation manner to the third possible implementation manner of the first aspect, in a fourth possible implementation manner, the method further includes: if the target data block is the same as the first data block, storing indication information, wherein the indication information is used for indicating the address of the target data block; if the target data block is a data block similar to the first data block, performing similar compression on the first data block based on the target data block, and storing compressed data; and if the target data block is not determined, storing the first data block.

In the fourth possible implementation manner, after the target data block is determined, if the target data block is the same as the first data block, data deduplication can be implemented in a manner of storing the indication information, and the data deduplication is a data lossless redundant data reduction technology, so that only one data block copy is stored in the storage system by a plurality of same data blocks, resources required by data storage are reduced, and the cost is saved; if the target data block is a data block similar to the first data block, after the target data block is determined, the first data block is subjected to similar compression based on the target data block, so that the amount of stored data can be reduced, the compression rate of data is improved, and the storage space is saved.

With reference to the first aspect and any one of the first possible implementation manner to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner, the method further includes: and calculating the characteristic fingerprint of the first data block, and adding the corresponding relation between the characteristic fingerprint of the first data block and the index of the first data block into the joint search structure.

In a second aspect, an apparatus for searching a data block is provided, including: the acquisition unit is used for acquiring K-level feature fingerprints of the first data block, wherein the level of the kth-level feature fingerprint in the K-level feature fingerprints is higher than the level of the kth-1-level feature fingerprint in the K-level feature fingerprints, K is an integer larger than 0, and K is an integer larger than 0 and smaller than or equal to K; the determining unit is used for determining a target index according to the K-level characteristic fingerprint and the combined search structure, determining a target data block according to the target index, wherein the target data block is one of M second data blocks corresponding to M indexes included in the combined search structure and has the highest similarity level with the first data block, the similarity level of the target data block and the first data block is greater than 0, the combined search structure includes M corresponding relations, one corresponding relation refers to the corresponding relation between the characteristic fingerprint of one second data block and the index of the second data block, the level number of the characteristic fingerprint of each second data block is greater than or equal to K, the total number of the indexes included in the combined search structure is the same as the total number of the second data blocks, and M is an integer greater than 0; wherein the similarity level is used to indicate the degree of correlation of the second data block with the first data block.

The units in the apparatus provided by the second aspect are configured to perform the method provided by the first aspect, and therefore, the beneficial effects of the apparatus can be referred to the beneficial effects of the above method part, and are not described herein again.

With reference to the second aspect, in a first possible implementation manner, the feature fingerprints of the first data block and the second data block include similar feature fingerprints, or the feature fingerprints of the first data block and the second data block include similar feature fingerprints and the same feature fingerprint.

With reference to the first possible implementation manner of the second aspect, in a second possible implementation manner, the determining unit is specifically configured to: matching the K-level characteristic fingerprint with the characteristic fingerprint in the first corresponding relation in the combined search structure; if the matching is successful, determining the index in the first corresponding relation as a target index; if the matching fails and the K-level feature fingerprint is larger than the feature fingerprints in the first corresponding relationship, recording the matching level of the K-level feature fingerprint and the feature fingerprints in the first corresponding relationship, determining that the corresponding relationship which is adjacent to the first corresponding relationship and larger than the feature fingerprints in the first corresponding relationship in the combined search structure is a new first corresponding relationship, and continuously matching the K-level feature fingerprint with the feature fingerprints in the new first corresponding relationship; if the matching fails and the K-level feature fingerprints are smaller than the feature fingerprints in the first corresponding relationship, recording the matching levels of the K-level feature fingerprints and the feature fingerprints in the first corresponding relationship, determining that the corresponding relationship which is adjacent to the first corresponding relationship and smaller than the feature fingerprints in the first corresponding relationship in the combined search structure is a new first corresponding relationship, continuously matching the K-level feature fingerprints and the new first corresponding relationship until a target index is determined, if all the first corresponding relationships fail to match the K-level feature fingerprints, wherein the target index is an index in the first corresponding relationship which has the highest matching level with the K-level feature fingerprints in all the first corresponding relationships, and the initial first corresponding relationship is the 1 st corresponding relationship in the combined search structure; matching the K-level feature fingerprints with feature fingerprints in a first correspondence in a joint lookup structure, comprising: and starting from the 1 st level feature fingerprint in the K level feature fingerprints, sequentially matching with the feature fingerprints at the same level in the first corresponding relation, wherein if the K level feature fingerprint in the K level feature fingerprint is successfully matched with the K level feature fingerprint in the first corresponding relation, the matching is successful, and otherwise, the matching is failed.

With reference to the first possible implementation manner of the second aspect, in a third possible implementation manner, the determining unit is specifically configured to: matching the a-th level feature fingerprint in the K-level feature fingerprints with the a-th level feature fingerprint in M corresponding relations in the combined search structure, determining all corresponding relations matched with the a-th level feature fingerprint in the K-level feature fingerprints, enabling a to be a +1, matching the a-th level feature fingerprint in the K-level feature fingerprints with the a-th level feature fingerprint in all corresponding relations, determining all corresponding relations matched with the a-th level feature fingerprint in the K-level feature fingerprints until an index in the corresponding relation with the highest matching level of the K-level feature fingerprints is determined as a target index, and enabling an initial value of a to be 1.

With reference to the second aspect, or any one of the first possible implementation manner to the third possible implementation manner of the second aspect, in a fourth possible implementation manner, the apparatus further includes a compressed storage unit; the compressed storage unit is used for storing indication information when the target data block is the same as the first data block, and the indication information is used for indicating the address of the target data block; or, when the target data block is a data block similar to the first data block, performing similar compression on the first data block based on the target data block, and storing the compressed data; or, for storing the first data block when the target data block is not determined.

With reference to the second aspect and any one of the first possible implementation manner to the fourth possible implementation manner of the second aspect, in a fifth possible implementation manner, the apparatus further includes an execution unit; and the execution unit is used for calculating the characteristic fingerprint of the first data block and adding the corresponding relation between the characteristic fingerprint of the first data block and the index of the first data block into the joint search structure.

In a third aspect, an apparatus for searching a data block is provided, including: a memory for storing a set of codes and a processor for performing the following actions in accordance with the set of codes: acquiring K-level feature fingerprints of the first data block, wherein the level of the kth-level feature fingerprint in the K-level feature fingerprints is higher than the level of the kth-1-level feature fingerprint in the K-level feature fingerprints, K is an integer larger than 0, and K is an integer larger than 0 and smaller than or equal to K; determining a target index according to the K-level characteristic fingerprint and a combined search structure, determining a target data block according to the target index, wherein the target data block is one of M second data blocks corresponding to M indexes in the combined search structure and having the highest similarity level with the first data block, the similarity level of the target data block and the first data block is greater than 0, the combined search structure comprises M corresponding relations, one corresponding relation is the corresponding relation between the characteristic fingerprint of one second data block and the index of the second data block, the level number of the characteristic fingerprint of each second data block is greater than or equal to K, the total number of the indexes in the combined search structure is the same as the total number of the second data blocks, and M is an integer greater than 0; wherein the similarity level is used to indicate the degree of correlation of the second data block with the first data block.

The devices in the apparatus provided by the third aspect are used for performing the method provided by the first aspect, and therefore, the beneficial effects of the apparatus can be referred to the beneficial effects of the above method part, and are not described herein again.

With reference to the third aspect, in a first possible implementation manner, the feature fingerprints of the first data block and the second data block include similar feature fingerprints, or the feature fingerprints of the first data block and the second data block include similar feature fingerprints and the same feature fingerprint.

With reference to the first possible implementation manner of the third aspect, in a second possible implementation manner, the processor is specifically configured to: matching the K-level characteristic fingerprint with the characteristic fingerprint in the first corresponding relation in the combined search structure; if the matching is successful, determining the index in the first corresponding relation as a target index; if the matching fails and the K-level feature fingerprint is larger than the feature fingerprints in the first corresponding relationship, recording the matching level of the K-level feature fingerprint and the feature fingerprints in the first corresponding relationship, determining that the corresponding relationship which is adjacent to the first corresponding relationship and larger than the feature fingerprints in the first corresponding relationship in the combined search structure is a new first corresponding relationship, and continuously matching the K-level feature fingerprint with the feature fingerprints in the new first corresponding relationship; if the matching fails and the K-level feature fingerprints are smaller than the feature fingerprints in the first corresponding relationship, recording the matching levels of the K-level feature fingerprints and the feature fingerprints in the first corresponding relationship, determining that the corresponding relationship which is adjacent to the first corresponding relationship and smaller than the feature fingerprints in the first corresponding relationship in the combined search structure is a new first corresponding relationship, continuously matching the K-level feature fingerprints and the new first corresponding relationship until a target index is determined, if all the first corresponding relationships fail to match the K-level feature fingerprints, wherein the target index is an index in the first corresponding relationship which has the highest matching level with the K-level feature fingerprints in all the first corresponding relationships, and the initial first corresponding relationship is the 1 st corresponding relationship in the combined search structure; matching the K-level feature fingerprints with feature fingerprints in a first correspondence in a joint lookup structure, comprising: and starting from the 1 st level feature fingerprint in the K level feature fingerprints, sequentially matching with the feature fingerprints at the same level in the first corresponding relation, wherein if the K level feature fingerprint in the K level feature fingerprint is successfully matched with the K level feature fingerprint in the first corresponding relation, the matching is successful, and otherwise, the matching is failed.

With reference to the first possible implementation manner of the third aspect, in a third possible implementation manner, the processor is specifically configured to: matching the a-th level feature fingerprint in the K-level feature fingerprints with the a-th level feature fingerprint in M corresponding relations in the combined search structure, determining all corresponding relations matched with the a-th level feature fingerprint in the K-level feature fingerprints, enabling a to be a +1, matching the a-th level feature fingerprint in the K-level feature fingerprints with the a-th level feature fingerprint in all corresponding relations, determining all corresponding relations matched with the a-th level feature fingerprint in the K-level feature fingerprints until an index in the corresponding relation with the highest matching level of the K-level feature fingerprints is determined as a target index, and enabling an initial value of a to be 1.

With reference to the third aspect and any one of the first possible implementation manner to the third possible implementation manner of the third aspect, in a fourth possible implementation manner, the processor is further configured to: when the target data block is the same as the first data block, storing indication information, wherein the indication information is used for indicating the address of the target data block; when the target data block is a data block similar to the first data block, performing similar compression on the first data block based on the target data block, and storing compressed data; when the target data block is not determined, the first data block is stored.

With reference to the third aspect and any one of the first possible implementation manner to the fourth possible implementation manner of the third aspect, in a fifth possible implementation manner, the processor is further configured to: and calculating the characteristic fingerprint of the first data block, and adding the corresponding relation between the characteristic fingerprint of the first data block and the index of the first data block into the joint search structure.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a multi-level lookup structure according to the prior art;

FIG. 2 is a block diagram of a hardware architecture of a computer according to an embodiment of the present invention;

fig. 3 is a flowchart of a method for searching a data block according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating a joint search structure according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a tree-based federated lookup architecture according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a continuous joint search structure according to an embodiment of the present invention;

fig. 7 is a schematic diagram illustrating a component of an apparatus for searching a data block according to an embodiment of the present invention;

FIG. 8 is a schematic diagram illustrating a structure of another apparatus for searching data blocks according to an embodiment of the present invention;

fig. 9 is a schematic composition diagram of another apparatus for searching for a data block according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method provided by the embodiment of the invention can be applied to the field of data storage, particularly to a compression storage scene needing multi-level similar data search, such as a main memory scene or a backup scene, and the like, and can also be applied to image compression, version control or other application scenes with poor direct compression effect.

The device for executing the method provided by the embodiment of the present invention may be a system or a device for searching data, and specifically may be a computer, and the hardware architecture composition of the computer may refer to fig. 2, including: input device, output device, memory, and Central Processing Unit (CPU).

The CPU is a core component of the computer system, and is composed of an arithmetic unit and a controller, the arithmetic unit is mainly used for processing data, and the controller is used for analyzing instructions and sending control signals to each component of the system orderly and purposefully according to the requirements of the instructions (see the direction of the arrow of the thin solid line in fig. 2 in particular) so that the whole system works coordinately and consistently. The memory can receive and store data and programs in the computer, and can read the stored data and programs according to commands, and the data flow of each device in fig. 2 can be referred to as the heavy arrow in fig. 2. The memory may be divided into an internal memory and an external memory according to the proximity to the CPU.

The input equipment is used for inputting data and programs to the computer, is a bridge for human-computer interaction between a user and the computer, and mainly comprises: a mouse, a keyboard, a camera, a scanner, a light pen, a voice input device, and the like, and a computer mainly acquires raw data and a program for processing the raw data through an input device. The output device mainly outputs data in the system, and common output devices include: display, printer, plotter, video output system, and voice output system.

In order to make the description of the embodiments of the present invention clearer, some terms appearing in the embodiments of the present invention are specifically explained as follows:

data block: a data block is a group or groups of records that are arranged consecutively together in sequence.

The characteristic fingerprint of the data block is a parameter which is obtained by calculating records contained in the data block by using a preset algorithm (for example, a local Sensitive Hash (L) L SH algorithm) and is used for determining the data block with the relevance of not 0 with the data block.

Same characteristic fingerprint of data block: for finding the characteristic fingerprint of the data block which is the same as the data block (i.e. the correlation degree is 1), the same characteristic fingerprint of any two different data blocks is different.

Similar feature fingerprints for data blocks: and the characteristic fingerprint is used for searching the data blocks with the correlation degree greater than 0 and less than 1.

Level of characteristic fingerprint of data block: the characteristic fingerprints of the data blocks are graded, and the characteristic fingerprints with higher grades are used for determining the data blocks with higher relevance to the data blocks.

A joint search structure: and determining a search structure of data blocks with correlation degree not being 0 with the data blocks to be stored in the stored data blocks for the data blocks to be stored according to the characteristic fingerprints of the stored data blocks and the indexes of the data blocks.

Similarity grade: a parameter for assessing the degree of correlation between two data blocks, the higher the similarity level, the higher the degree of correlation between the two data blocks.

Matching level: the number of levels to which the characteristic fingerprint of one data block can be matched with the characteristic fingerprint of another data block is the similarity level of the two data blocks.

An embodiment of the present invention provides a method for searching a data block, as shown in fig. 3, including:

301. acquiring K-level feature fingerprints of a first data block, wherein the level of a kth-level feature fingerprint in the K-level feature fingerprints is higher than the level of a (K-1) -level feature fingerprint in the K-level feature fingerprints, K is an integer larger than 0, and K is an integer larger than 0 and smaller than or equal to K.

Optionally, before step 301, the method further includes: acquiring a first data block; in this case, the step 301 may be implemented as follows: a K-level feature fingerprint of the first data block is computed. The first data block may be a data block uploaded by a user, or a data block received and sent by another device. The value of K may be a default or may be indicated by the device that sent the first data block.

Specifically, the first data block may be a data block to be stored, and the characteristic fingerprint of the first data block may be calculated by using L SH algorithm.

302. Determining a target index according to the K-level feature fingerprint and a joint search structure, determining a target data block according to the target index, wherein the target data block is one of M second data blocks corresponding to M indexes in the joint search structure and has the highest similarity level with the first data block, the similarity level of the target data block and the first data block is greater than 0, the joint search structure comprises M corresponding relations, one corresponding relation is the corresponding relation between the feature fingerprint of one second data block and the index of the second data block, the level number of the feature fingerprint of each second data block is greater than or equal to K, the total number of the indexes in the joint search structure is the same as the total number of the second data blocks, and M is an integer greater than 0; wherein the similarity level is used for indicating the correlation degree of the second data block and the first data block.

The determination method of the similarity level of the first data block and the second data block may be: when the K ' th level characteristic fingerprint of a second data block is the same as the K ' th level characteristic fingerprint of the first data block and the K ' +1 th level characteristic fingerprint of the second data block is different from the K ' +1 th level characteristic fingerprint of the first data block, the similarity level of the second data block and the first data block is K ', when the value of K ' is larger, the correlation degree of the second data block and the first data block is higher, and K ' is an integer which is greater than or equal to 0 and less than K.

Illustratively, the level of similarity of the second data chunk to the first data chunk is 1 when the level 1 characteristic fingerprint of the second data chunk is the same as the level 2 characteristic fingerprint of the first data chunk, and the level of similarity of the second data chunk to the first data chunk is 4 when the level 1-4 characteristic fingerprint of the second data chunk is the same as the level 5 characteristic fingerprint of the first data chunk.

Specifically, the index of the second data block is used to uniquely determine the second data block, and the index of the second data block may be a pointer of the second data block, and the pointer may point to an address of the second data block.

For the convenience of understanding the present invention, the level of the characteristic fingerprint is described by taking a person as an example, and when describing the living place of a person, the living place of the person can be generally located in detail by describing the sequence of the country where the person lives, the province where the person lives, the city where the person lives, the district where the person lives, and the street where the person lives, wherein the country where the person lives, the province where the person lives, the city where the person lives, the district where the person lives, and the street where the person lives can all be regarded as the characteristic fingerprint of the person, and the level of the characteristic fingerprints sequentially increases. It can be understood that when the k '+ 1 level feature fingerprints of two data blocks are the same, the k' level feature fingerprints of the two data blocks are necessarily the same, which is the same as if two individuals living in the same city must live in the same province.

Optionally, the characteristic fingerprints of the first data block and the second data block include similar characteristic fingerprints, or the characteristic fingerprints of the first data block and the second data block include similar characteristic fingerprints and the same characteristic fingerprint. When the feature fingerprints of the first data block and the second data block include both similar feature fingerprints and identical feature fingerprints, the method provided by the embodiment of the present invention may search for the second data block similar to the first data block, or may search for the second data block identical to the first data block.

For example, as shown in fig. 4, fig. 4 shows a joint lookup structure and a corresponding relationship between indexes in the joint lookup structure and second data blocks, where the joint lookup structure includes M corresponding relationships, and if M second data blocks all have N-level feature fingerprints, the mth corresponding relationship is a corresponding relationship between the N-level feature fingerprint { Fp1-Bm, Fp2-Bm, …, FpN-Bm } of the mth second data block Bm and the indexes I-Bm of the mth second data block Bm, a total number of indexes included in the joint lookup structure is the same as a total number of the second data blocks, and all feature fingerprints of one second data block correspond to only one index of the second data block.

According to the combined search structure provided by the embodiment of the invention, if a certain second data block is updated, the data in the corresponding relation containing the index of the second data block in the combined search structure can be directly updated, and the maintenance is simple and convenient. In the multi-level search structure in the prior art, once data is updated, all the search structures need to be updated, and the maintenance is inconvenient. It should be noted that the number of levels of the characteristic fingerprints of the M second data blocks is generally the same, but may also be different, for example, when M is 2, the level of the characteristic fingerprint of the first second data block may have 3 levels, and the level of the characteristic fingerprint of the second data block may have 5 levels. It should be noted that the method for determining the feature fingerprint of the same level of the first data block and the second data block and the feature fingerprint of the same level of the different second data block is the same.

Optionally, when the step 302 is implemented specifically, any one of the following manners may be adopted.

The first method is as follows: matching the K-level feature fingerprints with feature fingerprints in a first corresponding relation in the combined search structure;

if the matching is successful, determining the index in the first corresponding relation as a target index;

if the matching fails and the K-level feature fingerprint is larger than the feature fingerprints in the first corresponding relationship, recording the matching level of the K-level feature fingerprint and the feature fingerprints in the first corresponding relationship, determining that the corresponding relationship which is adjacent to the first corresponding relationship and is larger than the feature fingerprints in the first corresponding relationship in the combined search structure is a new first corresponding relationship, and continuously matching the K-level feature fingerprint and the feature fingerprints in the new first corresponding relationship; if matching fails and the K-level feature fingerprints are smaller than the feature fingerprints in the first corresponding relationship, recording the matching levels of the K-level feature fingerprints and the feature fingerprints in the first corresponding relationship, determining that a corresponding relationship which is adjacent to the first corresponding relationship and smaller than the feature fingerprints in the first corresponding relationship in the combined search structure is a new first corresponding relationship, continuously matching the K-level feature fingerprints and the new first corresponding relationship until a target index is determined, if all the first corresponding relationships fail to match the K-level feature fingerprints, wherein the target index is an index in a first corresponding relationship which has the highest matching level with the K-level feature fingerprints in all the first corresponding relationships, and the initial first corresponding relationship is the 1 st corresponding relationship in the combined search structure;

matching the K-level feature fingerprints with feature fingerprints in a first correspondence in the joint lookup structure, comprising: and sequentially matching the characteristic fingerprints with the same level in the first corresponding relation from the 1 st level characteristic fingerprint in the K level characteristic fingerprints, wherein if the K level characteristic fingerprint in the K level characteristic fingerprint is successfully matched with the K level characteristic fingerprint in the first corresponding relation, the matching is successful, and otherwise, the matching is failed.

In the process, the two characteristic fingerprints are matched, namely whether the two characteristic fingerprints are the same or not is judged, if yes, the two characteristic fingerprints are successfully matched, and otherwise, the matching fails.

When the matching between the kth ' level feature fingerprint in the first corresponding relationship and the kth ' level feature fingerprint in the K-level feature fingerprint is successful and the matching between the kth ' +1 level feature fingerprint in the first corresponding relationship and the kth ' +1 level feature fingerprint in the K-level feature fingerprint fails, the matching level between the first corresponding relationship and the K-level feature fingerprint is K ', and thus, the matching level between the first corresponding relationship and the K-level feature fingerprint is the similarity level between the second data block and the first data block corresponding to the index in the first corresponding relationship.

Wherein, when the first x characteristic fingerprints in the K-level characteristic fingerprints are the same as the first x characteristic fingerprints in the first corresponding relationship, and the x +1 characteristic fingerprint in the K-level characteristic fingerprints is greater than the x +1 characteristic fingerprint in the first corresponding relationship, the K-level characteristic fingerprints are greater than the characteristic fingerprints in the first corresponding relationship, and when the first y characteristic fingerprints in the K-level characteristic fingerprints are the same as the first y characteristic fingerprints in the first corresponding relationship, and the y +1 characteristic fingerprint in the K-level characteristic fingerprint is less than the y +1 characteristic fingerprint in the first corresponding relationship, the characteristic fingerprint of the K-level characteristic fingerprint is less than the characteristic fingerprint in the first corresponding relationship, and x and y are integers greater than or equal to 0, in the embodiment of the invention, the size of the characteristic fingerprint is a result obtained by calculating each characteristic fingerprint of the same level by adopting a preset algorithm and is used for comparing each fingerprint of the same level, there is no other meaning.

Specifically, in the first execution mode, in order to search the data block more conveniently, the design and implementation of the joint search structure may be performed based on the tree structure.

Exemplarily, taking K as 2 and the second data block has a 2-level feature fingerprint as an example, assuming that there are 7 second data blocks, the joint search structure includes 7 correspondences, and the 2-level feature fingerprints in the 7 correspondences are [ L2, H5], [ L1, H2], [ L4, H4], [ L5, H6], [ L4, H3], [ L2, H7] and [ L4, H1], respectively, where L denotes similar fingerprints, H denotes identical fingerprints, and numbers after L and H denote sizes of the fingerprints, and the tree-type joint search structure is as shown in fig. 5 (the correspondence between the feature fingerprints and the index is not shown).

If the K-level feature fingerprint of the first data block is [ L, H3], the process of searching for data blocks using the above-mentioned alternative method is to match the 1 st-level feature fingerprint L4 of the first data block with the 1 st-level feature fingerprint L02 of the 1 st correspondence of 7 correspondences, the matching is unsuccessful and L is larger than L, the matching level is 0, determine the feature fingerprint in the new first correspondence as [ L, H4], match the 1 st-level feature fingerprint L69544 of the first data block with L4 of [ L, H4], the matching is successful, match the 2 nd-level feature fingerprint H3 of the first data block with H4 of [ L, H4], the matching is successful and H4 is smaller than H4, the matching level is 1, determine the feature fingerprint in the new first correspondence as [ 4, H4], determine the first data block matching is successful and the first-level feature fingerprint matching is shown by the flow of matching between the first data block and the first-level matching arrow H4, the first data block 4, the matching flow of the first data block is determined as a specific matching flow of the matching arrow H4, the matching flow of the first data block 4, the first data block index H4, the matching flow of the first data block 364, the matching flow of the matching flow shown by the arrow 4, the matching flow of the first matching flow of the matching flow shown in.

Based on the example described in fig. 5, if the K-level feature fingerprints of the first data block are [ L4, H1], the above-mentioned optional method for searching the data blocks is performed by matching the 1 st-level feature fingerprint L4 of the first data block with the 1 st-level feature fingerprint L02 of the 1 st corresponding relationship among the 7 corresponding relationships, the matching level is 0, determining that the feature fingerprints in the new first corresponding relationship are [ L, H L ], matching the 1 st-level feature fingerprint L of the first data block with L64 of [ L, H L ] successfully, matching the 2 nd-level feature fingerprint H L of the first data block with the H L of [ L, H L ], matching the H L with the H L of the [ L, H L, matching the H L is less than H L, the matching level is 1, determining that the feature fingerprints in the new first corresponding relationship are [ L, H84, H L ] successfully, determining that the first data block is matched with the first corresponding arrow L, the first data block L, and the matching process for the first data block L is successful, determining that the first data is performed by matching the first-level is performed by matching the first-step L, and the matching step L is performed by determining that the first-step L, the matching step L is performed by using the matching step L, the matching step L is performed by determining that the matching step L, the matching step L is performed by step L, the matching step L is performed by step L, the matching step L, the matching step L, the step L is performed by determining that the matching step L is performed by step H72, the matching step L is performed by step L.

The second method comprises the following steps: matching the a-th level feature fingerprint in the K-level feature fingerprints with the a-th level feature fingerprint in the M corresponding relations in the combined search structure, determining all corresponding relations matched with the a-th level feature fingerprint in the K-level feature fingerprints, making a equal to a +1, matching the a-th level feature fingerprint in the K-level feature fingerprints with the a-th level feature fingerprints in all corresponding relations, determining all corresponding relations matched with the a-th level feature fingerprint in the K-level feature fingerprints until determining that the index in the corresponding relation with the highest matching level of the K-level feature fingerprints is the target index, and the initial value of a is 1.

Specifically, in the second implementation manner, in order to search the data block more conveniently, the design and implementation of the joint search structure may be performed based on a continuous structure such as an array or a linked list.

Exemplarily, as shown in fig. 6, the 3-level feature fingerprints of the 1 st to 10 th data blocks are [ L1, S1, H2], [ L2, S2, H7], [ L3, S4, H4], [ L2, S2, H21], [ L3, S4, H6], [ L1, S3, H5], [ L1, S3, H18], [ L2, S2, H1], [ L1, S1, H3] and [ L3, S4, H8], respectively, then the feature fingerprints of these 10 data blocks can be stored as a continuous type joint search structure, which includes 10 correspondences (indexes in the correspondences are not shown), specifically as shown in fig. 6, where the 2 nd fingerprint corresponds are stored, and all the same corresponding relations of the 1 st level are also stored.

In a second implementation, if the first data block includes 3-level feature fingerprints, specifically [ L2, S2, H1], then the 1 st-level feature fingerprint L2 in the 3-level feature fingerprints of the first data block may be matched with the 1 st-level feature fingerprints of 10 corresponding relations, the feature fingerprints in all corresponding relations identical to L are determined as [ L, S L, H L ], the 2 nd-level feature fingerprint S L in the 3-level feature fingerprints of the first data block is matched with [ L, S L, H L ], the 2 nd-level feature fingerprint in all corresponding relations matched with S L is determined as the corresponding relation between the [ L, S L ], [ L, S L, H L, and the target feature fingerprints in the target data block L, and the corresponding relation between the corresponding fingerprint is determined as [ L, the corresponding relation between the corresponding features of the [ L, S L, H L, the target features in the target features of the target data.

By adopting the method of the first mode and the second mode to search the data blocks, the K-level characteristic fingerprints of the first data block do not need to be matched with the characteristic fingerprints of all the second data blocks, and the searching efficiency can be improved.

Optionally, after step 302, the method further includes:

if the target data block is the same as the first data block, storing indication information, wherein the indication information is used for indicating the address of the target data block;

if the target data block is a data block similar to the first data block, performing similar compression on the first data block based on the target data block, and storing compressed data;

and if the target data block is not determined, storing the first data block.

According to the optional method, after the target data block is determined, if the target data block is the same as the first data block, data deduplication can be achieved in a mode of storing indication information, and data deduplication is a data lossless redundant data reduction technology, so that only one data block copy is stored in a storage system by multiple same data blocks, resources needed by data storage are reduced, and cost is saved; if the target data block is a data block similar to the first data block, after the target data block is determined, the first data block is subjected to similar compression based on the target data block, so that the amount of stored data can be reduced, the compression rate of the data is improved, and the storage space is saved; and if the target data block is not determined, which indicates that no data block which is the same as or similar to the first data block exists in the M second data blocks, directly storing the first data block.

Optionally, after step 302, the method further includes: and calculating the characteristic fingerprint of the first data block, and adding the corresponding relation between the characteristic fingerprint of the first data block and the index of the first data block into the joint search structure.

The method provided by the embodiment of the invention can search the same data or similar data through the combined search structure, and because the total number of indexes contained in the combined search structure is the same as the total number of the second data blocks and all the characteristic fingerprints of one second data block correspond to only one index of the second data block, compared with the scheme in the prior art, the redundancy of the data can be greatly reduced.

An embodiment of the present invention further provides an apparatus 70 for searching a data block, as shown in fig. 7, including:

an obtaining unit 701, configured to obtain K-level feature fingerprints of a first data block, where a level of a kth-level feature fingerprint in the K-level feature fingerprints is higher than a level of a (K-1) -level feature fingerprint in the K-level feature fingerprints, K is an integer greater than 0, and K is an integer greater than 0 and less than or equal to K;

a determining unit 702, configured to determine a target index according to the K-level feature fingerprint and a joint search structure, and determine a target data block according to the target index, where the target data block is a second data block with a highest similarity level to the first data block, and the similarity level of the target data block and the first data block is greater than 0, the joint search structure includes M correspondence relationships, where one correspondence relationship is a correspondence relationship between a feature fingerprint of a second data block and an index of the second data block, the number of levels of the feature fingerprint of each second data block is greater than or equal to K, the total number of indexes included in the joint search structure is the same as the total number of second data blocks, and M is an integer greater than 0; wherein the similarity level is used for indicating the correlation degree of the second data block and the first data block.

Optionally, the characteristic fingerprints of the first data block and the second data block include similar characteristic fingerprints, or the characteristic fingerprints of the first data block and the second data block include similar characteristic fingerprints and the same characteristic fingerprint.

Optionally, the determining unit 702 is specifically configured to:

matching the K-level feature fingerprints with feature fingerprints in a first corresponding relation in the combined search structure;

Optionally, the determining unit 702 is specifically configured to:

matching the a-th level feature fingerprint in the K-level feature fingerprints with the a-th level feature fingerprint in the M corresponding relations in the combined search structure, determining all corresponding relations matched with the a-th level feature fingerprint in the K-level feature fingerprints, making a equal to a +1, matching the a-th level feature fingerprint in the K-level feature fingerprints with the a-th level feature fingerprints in all corresponding relations, determining all corresponding relations matched with the a-th level feature fingerprint in the K-level feature fingerprints until determining that the index in the corresponding relation with the highest matching level of the K-level feature fingerprints is the target index, and the initial value of a is 1.

Optionally, as shown in fig. 8, the apparatus 70 further includes a compressed storage unit 703;

the compressed storage unit 703 is configured to store indication information when the target data block is the same data block as the first data block, where the indication information is used to indicate an address of the target data block; or, when the target data block is a data block similar to the first data block, performing similar compression on the first data block based on the target data block, and storing compressed data; or storing the first data block when the target data block is not determined.

Optionally, as shown in fig. 8, the apparatus 70 further includes an execution unit 704;

the executing unit 704 is configured to calculate a feature fingerprint of the first data block, and add a correspondence between the feature fingerprint of the first data block and the index of the first data block to the joint lookup structure.

Each unit in the apparatus 70 provided in the embodiment of the present invention is configured to execute the method, and therefore, beneficial effects of the apparatus 70 may refer to beneficial effects of the method part, which are not described herein again.

An embodiment of the present invention further provides a device 90 for searching a data block, as shown in fig. 9, including: a memory 901 and a processor 902, the memory 901 being adapted to store a set of codes, the processor 902 being adapted to perform the method illustrated in fig. 3 above according to the set of codes.

Each functional unit in the apparatus for searching for a data block may be embedded in a processor of the apparatus for searching for a data block in a hardware form or may be independent of the processor of the apparatus for searching for a data block, or may be stored in a processor of the apparatus for searching for a data block in a software form, so that the processor calls and executes operations corresponding to each unit. The Processor may be a CPU, a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other Programmable logic devices, transistor logic devices, hardware components, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. A processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a DSP and a microprocessor, or the like.

Each device in the apparatus 90 provided by the embodiment of the present invention is configured to perform the method, and therefore, beneficial effects of the apparatus 90 may refer to beneficial effects of the method part, which are not described herein again. The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or in software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in Random Access Memory (RAM), flash Memory, Read Only Memory (ROM), Erasable Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable hard disk, a compact disc Read Only Memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

Claims

1. A method for searching a data block, comprising:

acquiring K-level feature fingerprints of a first data block, wherein the level of a kth-level feature fingerprint in the K-level feature fingerprints is higher than the level of a (K-1) -level feature fingerprint in the K-level feature fingerprints, K is an integer larger than 0, and K is an integer larger than 0 and smaller than or equal to K;

determining a target index according to the K-level feature fingerprint and a joint search structure, determining a target data block according to the target index, wherein the target data block is one of M second data blocks corresponding to M indexes in the joint search structure and has the highest similarity level with the first data block, the similarity level of the target data block and the first data block is greater than 0, the joint search structure comprises M corresponding relations, one corresponding relation is the corresponding relation between the feature fingerprint of one second data block and the index of the second data block, the level number of the feature fingerprint of each second data block is greater than or equal to K, the total number of the indexes in the joint search structure is the same as the total number of the second data blocks, and M is an integer greater than 0; wherein the similarity level is used for indicating the correlation degree of the second data block and the first data block.

2. The method of claim 1, wherein the characteristic fingerprints of the first data chunk and the second data chunk comprise similar characteristic fingerprints, or wherein the characteristic fingerprints of the first data chunk and the second data chunk comprise similar characteristic fingerprints and the same characteristic fingerprint.

3. The method of claim 2, wherein determining a target index from the K-level feature fingerprints and joint lookup structure comprises:

4. The method of claim 2, wherein determining a target index from the K-level feature fingerprints and joint lookup structure comprises:

5. The method according to any one of claims 1-4, further comprising:

and if the target data block is not determined, storing the first data block.

6. The method according to any one of claims 1-4, further comprising:

and calculating the characteristic fingerprint of the first data block, and adding the corresponding relation between the characteristic fingerprint of the first data block and the index of the first data block into the joint search structure.

7. An apparatus for searching a block of data, comprising:

the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring K-level feature fingerprints of a first data block, the level of a kth-level feature fingerprint in the K-level feature fingerprints is higher than that of a (K-1) -level feature fingerprint in the K-level feature fingerprints, K is an integer larger than 0, and K is an integer larger than 0 and smaller than or equal to K;

a determining unit, configured to determine a target index according to the K-level feature fingerprint and a joint search structure, and determine a target data block according to the target index, where the target data block is a second data block with a highest similarity level to the first data block, and the similarity level of the target data block and the first data block is greater than 0, the joint search structure includes M correspondence relationships, where one correspondence relationship is a correspondence relationship between a feature fingerprint of a second data block and an index of the second data block, the number of levels of the feature fingerprint of each second data block is greater than or equal to K, the total number of indexes included in the joint search structure is the same as the total number of second data blocks, and M is an integer greater than 0; wherein the similarity level is used for indicating the correlation degree of the second data block and the first data block.

8. The apparatus of claim 7, wherein the feature fingerprints of the first data block and the second data block comprise similar feature fingerprints, or wherein the feature fingerprints of the first data block and the second data block comprise similar feature fingerprints and identical feature fingerprints.

9. The apparatus according to claim 8, wherein the determining unit is specifically configured to:

10. The apparatus according to claim 8, wherein the determining unit is specifically configured to:

11. The apparatus according to any one of claims 7-10, further comprising a compressed storage unit;

the compressed storage unit is configured to store indication information when the target data block is a data block that is the same as the first data block, where the indication information is used to indicate an address of the target data block; or, when the target data block is a data block similar to the first data block, performing similar compression on the first data block based on the target data block, and storing compressed data; or storing the first data block when the target data block is not determined.

12. The apparatus according to any of claims 7-10, wherein the apparatus further comprises an execution unit;

the execution unit is configured to calculate a feature fingerprint of the first data block, and add a correspondence between the feature fingerprint of the first data block and the index of the first data block to the joint lookup structure.

13. An apparatus for searching a block of data, comprising: a memory for storing a set of codes and a processor for performing the following actions in accordance with the set of codes:

14. The apparatus of claim 13, wherein the feature fingerprints of the first data block and the second data block comprise similar feature fingerprints, or wherein the feature fingerprints of the first data block and the second data block comprise similar feature fingerprints and identical feature fingerprints.

15. The apparatus of claim 14, wherein the processor is specifically configured to:

16. The apparatus of claim 14, wherein the processor is specifically configured to:

17. The apparatus according to any of claims 13-16, wherein the processor is further configured to:

when the target data block is the same as the first data block, storing indication information, wherein the indication information is used for indicating the address of the target data block;

when the target data block is a data block similar to the first data block, performing similar compression on the first data block based on the target data block, and storing compressed data;

and when the target data block is not determined, storing the first data block.

18. The apparatus according to any of claims 13-16, wherein the processor is further configured to: