WO2017020735A1 - 一种数据处理方法、备份服务器及存储系统 - Google Patents

一种数据处理方法、备份服务器及存储系统 Download PDF

Info

Publication number
WO2017020735A1
WO2017020735A1 PCT/CN2016/091054 CN2016091054W WO2017020735A1 WO 2017020735 A1 WO2017020735 A1 WO 2017020735A1 CN 2016091054 W CN2016091054 W CN 2016091054W WO 2017020735 A1 WO2017020735 A1 WO 2017020735A1
Authority
WO
WIPO (PCT)
Prior art keywords
fingerprint
index
probability
stored
data block
Prior art date
Application number
PCT/CN2016/091054
Other languages
English (en)
French (fr)
Inventor
吴晨涛
黄洵松
薛常亮
王元钢
Original Assignee
华为技术有限公司
上海交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司, 上海交通大学 filed Critical 华为技术有限公司
Publication of WO2017020735A1 publication Critical patent/WO2017020735A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1666Error detection or correction of the data by redundancy in hardware where the redundant component is memory or memory area
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a data processing method, a backup server, and a storage system.
  • deduplication technology is a key technology to save data storage space. It can detect and eliminate data redundancy, leaving only one copy of the same data, which not only saves disk space but also improves Data write performance, as well as saving network bandwidth, are widely used in file backup, online storage services, email services and other fields.
  • a fingerprint table is stored in the storage system, and the fingerprint of the data stored in the storage system is stored in the fingerprint table.
  • the storage system compares the fingerprint of the data block to be stored with the fingerprint in the fingerprint table to determine whether the data block to be stored is duplicate data, and further determines a storage manner of the data block to be stored.
  • the fingerprint in the fingerprint table needs to be read into the memory of the backup server. Since the fingerprint table stores a large number of fingerprints, therefore, all Fingerprint comparison in the fingerprint table will generate a large amount of data handling, which requires a large amount of input and output (input/output; referred to as: I/O) resources, which takes a lot of time, resulting in inefficient storage of data in the storage system.
  • I/O input and output
  • the embodiment of the invention provides a data processing method, a backup server and a storage system, which are used to solve the problem that the efficiency of data storage is low due to the consumption of a large amount of I/O resources by fingerprint comparison.
  • an embodiment of the present invention provides a data processing method, where the method is performed by a backup server in a storage system, where the storage system includes the backup server and a plurality of memories.
  • a plurality of fingerprint tables are stored in the storage system, and the fingerprints of the data blocks that have been stored in the plurality of memories are recorded in the plurality of fingerprint tables, and the method includes:
  • the first fingerprint set includes a first index fingerprint and a second index fingerprint, where the first index fingerprint is used Representing a plurality of fingerprints in the first fingerprint table, the second index fingerprint is used to represent multiple fingerprints in the second fingerprint table, and the fingerprint of the to-be-stored data block belongs to multiple a fingerprint and a fingerprint range of the plurality of fingerprints represented by the second index fingerprint;
  • the first index fingerprint Obtaining, according to the first index fingerprint, a first probability that the first fingerprint table includes the same fingerprint as the fingerprint of the to-be-stored data block, and obtaining, according to the second index fingerprint, that the second fingerprint table is included a second probability of the same fingerprint as the fingerprint of the data block to be stored, wherein the first probability is determined according to the plurality of fingerprints represented by the first index fingerprint, and the second probability is according to the first Two fingerprints represented by the index fingerprint are determined;
  • the first fingerprint table is stored in a first memory of the plurality of memories, and the second fingerprint table is stored in the multiple
  • the second probability of obtaining a fingerprint in the first fingerprint table that is the same as the fingerprint of the to-be-stored data block according to the first index fingerprint includes:
  • a second probability that the second fingerprint table includes the same fingerprint as the fingerprint of the to-be-stored data block including:
  • the backup server includes an auxiliary memory, where the first fingerprint table and the second fingerprint table are stored in the auxiliary storage;
  • the auxiliary storage Receiving, by the auxiliary storage, the first probability that a fingerprint corresponding to the fingerprint of the to-be-stored data block is included in a plurality of fingerprints represented by the first index fingerprint, and the second index fingerprint
  • the plurality of fingerprints represented include the second probability of the same fingerprint as the fingerprint of the data block to be stored.
  • Each fingerprint in the first fingerprint table includes M bits, and each M-bit fingerprint includes N intervals, each of the N intervals includes consecutive S bits in the M bits, and any two of the N intervals The intervals do not overlap, the sum of the number of bits of the N intervals is M, N is a natural number greater than or equal to 2, and S is a natural number;
  • a first statistic table is stored in the storage system, where the first statistic table includes the first index
  • the statistical information of the values of the plurality of fingerprints represented by the fingerprint in the N intervals, and the determining the first probability includes:
  • the first probability is determined based on the obtained minimum value of t 1 to t N .
  • a first statistic table is stored in the storage system, where the first statistic table includes statistical information of values of the first interval of the plurality of fingerprints represented by the first index fingerprint, and the plurality of indexes represented by the first index fingerprint.
  • the statistical information of the value of the second interval of the fingerprint, the first interval is the interval from the hth to the ithth of each fingerprint, and the second interval is the interval from the jth to the kth of each fingerprint.
  • the first probability The way to determine includes:
  • the first probability is determined based on a minimum of the t 1 and t 2 .
  • an embodiment of the present invention provides a backup server, where the backup server is applied to a storage system, where the storage system includes the backup server and a plurality of memories, and the storage system stores multiple fingerprint tables.
  • a fingerprint of a data block that has been stored in the plurality of memories is recorded in the plurality of fingerprint tables, and the backup server includes:
  • a determining module configured to determine a first fingerprint set according to an index fingerprint in the fingerprint index table and a fingerprint of the data block to be stored, where the first fingerprint set includes a first index fingerprint and a second index fingerprint, where An index fingerprint is used to represent a plurality of fingerprints in the first fingerprint table, and the second index finger The fingerprint is used to represent a plurality of fingerprints in the second fingerprint table, and the fingerprint of the data block to be stored belongs to the plurality of fingerprints represented by the first index fingerprint and the fingerprints of the plurality of fingerprints represented by the second index fingerprint range;
  • Obtaining a module configured to obtain, according to the first index fingerprint, a first probability that a fingerprint of the first fingerprint table includes the same fingerprint as the data block to be stored, and obtain the second fingerprint according to the second index fingerprint
  • the fingerprint table includes a second probability of the same fingerprint as the fingerprint of the data block to be stored, wherein the first probability is determined according to multiple fingerprints represented by the first index fingerprint, and the second probability Is determined according to the plurality of fingerprints represented by the second index fingerprint;
  • the determining module is further configured to determine a second fingerprint set according to the first probability and the second probability, where the second fingerprint set includes at least the first index fingerprint, according to the first index fingerprint
  • the determined first probability is not less than a preset threshold
  • a processing module configured to obtain a matching result of the plurality of fingerprints represented by the first index fingerprint and the fingerprint of the to-be-stored data block.
  • the first fingerprint table is stored in a first memory of the plurality of memories, and the second fingerprint table is stored in the multiple a second memory in the memory;
  • the obtaining module is specifically configured to: send the fingerprint of the to-be-stored data block and the first index fingerprint to the first memory; and receive the first probability returned by the first memory, where the a probability for indicating a probability that a fingerprint corresponding to the fingerprint of the data block to be stored is included in the plurality of fingerprints represented by the first index fingerprint;
  • the plurality of fingerprints represented by the second index fingerprint include a probability of the same fingerprint as the fingerprint of the data block to be stored.
  • the backup server further includes:
  • An auxiliary storage configured to store the first fingerprint table and the second fingerprint table
  • the obtaining module is specifically configured to: send the fingerprint of the to-be-stored data block and the first index fingerprint and the second index fingerprint to the auxiliary storage; and receive the auxiliary memory returned at the first
  • the first fingerprint represented by the index fingerprint includes the first probability of the same fingerprint as the fingerprint of the data block to be stored, and the plurality of fingerprints represented by the second index fingerprint are included in the fingerprint The second probability of the fingerprint of the fingerprint of the data block to be stored.
  • each fingerprint in the first fingerprint table includes M bits, and each M-bit fingerprint includes N In the interval, each of the N intervals includes consecutive S bits in the M bits, and any two of the N intervals do not overlap, and the sum of the number of bits of the N intervals is M, and N is greater than Or a natural number equal to 2, S is a natural number;
  • the auxiliary memory is further configured to store a first statistical table, where the first statistical table includes the N intervals of the plurality of fingerprints represented by the first index fingerprint Numerical statistical information;
  • the auxiliary memory is further configured to: determine the value a i of the plurality of fingerprints in a fingerprint the first index represents the i-th section in the occurrence frequency based on the first statistics table T i, where, a i For the value of the i-th interval of the fingerprint of the data block to be stored, i ranges from 1 to N, and the first probability is determined according to the minimum value of t 1 to t N .
  • the auxiliary memory is further configured to store a first statistics table, where the first statistical table includes the first Statistical information of a numerical value of a first interval of a plurality of fingerprints represented by an index fingerprint, and statistical information of a numerical value of a second interval of the plurality of fingerprints represented by the first index fingerprint, wherein the first interval is each fingerprint
  • the second interval is the interval from the jth to the kth of each fingerprint, wherein h, i, j, and k are all natural numbers, and the value of h is not greater than i. a value, a value of j is not greater than a value of k, and the first interval and the second interval do not overlap;
  • the auxiliary storage is further configured to: determine, according to the first statistical table, an appearance frequency t 1 and b of a value of the first interval of the plurality of fingerprints represented by the first index fingerprint in the first a frequency t 2 occurring in the value of the second interval of the plurality of fingerprints represented by the index fingerprint, where a is the value of the hth to the ith bits of the fingerprint of the data block to be stored, and b is the data to be stored j-th bit to k-bit value of the block of fingerprint; and the first probability is determined based on the t 1 and t 2 are the minimum value.
  • an embodiment of the present invention provides a storage system, including a backup server and a plurality of memories, where the storage system stores a plurality of fingerprint tables, and the plurality of fingerprint tables are recorded and stored in the plurality of fingerprint tables.
  • the backup server is used to:
  • the first fingerprint set includes a first index fingerprint and a second index fingerprint, where the first index fingerprint is used Representing a plurality of fingerprints in the first fingerprint table, the second index fingerprint is used to represent multiple fingerprints in the second fingerprint table, and the fingerprint of the to-be-stored data block belongs to multiple a fingerprint and a fingerprint range of the plurality of fingerprints represented by the second index fingerprint;
  • the first index fingerprint Obtaining, according to the first index fingerprint, a first probability that the first fingerprint table includes the same fingerprint as the fingerprint of the to-be-stored data block, and obtaining, according to the second index fingerprint, that the second fingerprint table is included a second probability of the same fingerprint as the fingerprint of the data block to be stored, wherein the first probability is determined according to the plurality of fingerprints represented by the first index fingerprint, and the second probability is according to the first Two fingerprints represented by the index fingerprint are determined;
  • the first fingerprint table is stored in a first memory of the plurality of memories, and the second fingerprint table is stored in the multiple The second memory in the memory; the backup server is specifically used to:
  • the plurality of fingerprints represented by the second index fingerprint include a probability of the same fingerprint as the fingerprint of the data block to be stored;
  • the first memory is specifically configured to: receive a first index fingerprint sent by the backup server, and a fingerprint of the to-be-stored data block, and determine that the plurality of fingerprints represented by the first index fingerprint are included in the Storing a first probability of a fingerprint of the same fingerprint of the data block, and transmitting the first probability to the backup server;
  • the second memory is specifically configured to: receive a second index fingerprint sent by the backup server, and a fingerprint of the to-be-stored data block, and determine that the multiple fingerprints represented by the second index fingerprint are included in the A second probability of storing the fingerprint of the data block with the same fingerprint and transmitting the second probability to the backup server.
  • each fingerprint in the first fingerprint table includes M bits, each M bit The fingerprint includes N intervals, each of the N intervals includes consecutive S bits of M bits, and any two of the N intervals do not overlap, and the sum of the number of bits of the N intervals is M , N is a natural number greater than or equal to 2, and S is a natural number;
  • the first memory stores a first statistical table, where the first statistical table includes the N of the plurality of fingerprints represented by the first index fingerprint Statistical information of the values of the intervals;
  • the first memory is configured to: determining the occurrence frequency of a i T i value of the plurality of fingerprints in the fingerprint represented by the first index in the i-th interval based on the first statistics, where, a i is the value of the i-th interval of the fingerprint of the data block to be stored, and the value of i ranges from 1 to N;
  • the first probability is determined according to a minimum value of t 1 to t N .
  • the first statistic is stored on the first memory, where the first statistic Statistical information including a numerical value of a first interval of the plurality of fingerprints represented by the first index fingerprint, and statistical information of a numerical value of a second interval of the plurality of fingerprints represented by the first index fingerprint, the first The interval is the interval from the hth to the ithth of each fingerprint, and the second interval is the interval from the jth to the kth of each fingerprint, wherein h, i, j, and k are natural numbers, and the value of h Not greater than the value of i, the value of j is not greater than the value of k, the first interval and the second interval do not overlap;
  • the first memory is specifically configured to:
  • the first probability is determined based on a minimum of the t 1 and t 2 .
  • the backup server first determines, according to the index fingerprint in the fingerprint index table and the fingerprint of the data block to be stored, a first fingerprint set that may include a fingerprint of the data block to be stored. Then, the backup server obtains, according to the first index fingerprint in the first fingerprint set, a first probability that the first fingerprint table includes the same fingerprint as the fingerprint of the to-be-stored data block, and according to the first fingerprint set.
  • the second index fingerprint in the second fingerprint obtains a second probability that the second fingerprint table contains the same fingerprint as the fingerprint of the data block to be stored.
  • the second fingerprint set and the multiple fingerprints represented by the index fingerprint in the second fingerprint set and the data block to be stored.
  • the fingerprints are matched to obtain a matching result.
  • the data processing method provided by the embodiment of the present invention, in the fingerprint matching process, only the fingerprint of the data block to be stored and the multiple fingerprints represented by the index fingerprint in the obtained second fingerprint set may be matched without waiting for the data processing method.
  • the fingerprint of the stored data block is matched with the fingerprint of all the data in the fingerprint library, which reduces the amount of data carried during the fingerprint comparison process and improves the efficiency of data processing.
  • FIG. 1 is a schematic structural diagram of a storage system according to an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a fingerprint table according to an embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of a data processing method according to an embodiment of the present invention.
  • FIG. 4 is a schematic block diagram showing the structure of a storage system according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a fingerprint index table according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of an index fingerprint of a fingerprint table according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural block diagram of still another storage system according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic diagram of an implementation manner of a statistics table according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic diagram of another implementation manner of a statistics table according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic block diagram showing the structure of a backup server according to an embodiment of the present invention.
  • the present invention provides a data processing method, a backup server, and a storage system, in which the storage system consumes a large amount of I/O resources during fingerprint matching, resulting in a low efficiency of data storage.
  • FIG. 1 is a schematic structural diagram of a storage system according to an embodiment of the present invention.
  • the storage system 10 includes a backup server 11 for storing data, and a backup server 11 for determining whether the data block to be stored is duplicate data, And the scheduling memory 12 stores the stored data blocks.
  • the storage system 10 stores a fingerprint (English: Fingerprint; abbreviation: FP) table
  • FIG. 2 is a schematic diagram of a fingerprint table in which a fingerprint of data stored in the memory 12 and a storage location of the data are stored.
  • the backup server 11 After receiving the request for storing data, the backup server 11 compares the fingerprint of the data block to be stored with the fingerprint in the fingerprint table. If the fingerprint of the same data block as the data block to be stored is retrieved in the fingerprint table, It indicates that the data block to be stored is duplicate data, and the storage system does not need to store the data block, and only needs to update its reference relationship; otherwise, if the same fingerprint as the fingerprint of the data block to be stored is not retrieved in the fingerprint table, it indicates that The storage data block is new data, and the storage system allocates storage space to store the data block.
  • FIG. 3 is a schematic flowchart of a data processing method according to an embodiment of the present disclosure, where the method may include:
  • Step 101 Determine a first fingerprint set according to the index fingerprint in the fingerprint index table and the fingerprint of the data block to be stored, where the first fingerprint set includes a first index fingerprint and a second index fingerprint, and the first index fingerprint is used to represent a plurality of fingerprints in the first fingerprint table, the second index fingerprint is used to represent multiple fingerprints in the second fingerprint table, and the fingerprint of the data block to be stored belongs to multiple fingerprints represented by the first index fingerprint and the second index fingerprint The fingerprint range of multiple fingerprints represented;
  • Step 102 Obtain a first probability that the first fingerprint table includes the same fingerprint as the fingerprint of the data block to be stored according to the first index fingerprint, and obtain, according to the second index fingerprint, the data block that is included in the second fingerprint table and is to be stored. a second probability that the fingerprint is the same fingerprint, wherein the first probability is determined according to the plurality of fingerprints represented by the first index fingerprint, and the second probability is determined according to the plurality of fingerprints represented by the second index fingerprint;
  • Step 103 Determine a second fingerprint set according to the first probability and the second probability, where the second fingerprint set includes at least a first index fingerprint, and the first probability determined according to the first index fingerprint is not less than a preset threshold;
  • Step 104 Obtain a matching result of the plurality of fingerprints represented by the first index fingerprint and the fingerprint of the data block to be stored.
  • the data processing methods corresponding to the steps 101 to 104 may have at least two implementation manners, which are respectively introduced below.
  • FIG. 4 is a schematic structural diagram of a storage system 20 corresponding to Embodiment 1, and the storage system 20 includes a backup server 21 and a plurality of memories 22. In the first embodiment, steps 101 to 104 are executed by the backup server 21.
  • the backup server 21 includes a processor 211, a memory 212, and an auxiliary memory 213.
  • the auxiliary memory 213 includes a processing unit 214. Therefore, the auxiliary memory 213 has computing power.
  • the fingerprint table in the storage system is stored in the auxiliary storage 213. In actuality, the number of the auxiliary storages 213 may be one or two or more.
  • the backup server 21 stores a fingerprint index table, which may be stored in the memory 212, or in the auxiliary storage 213, or in other storage units of the backup server 21.
  • FIG. 5 is a schematic diagram of a fingerprint index table, where at least two index fingerprints of each fingerprint table are stored in the fingerprint index table, and each index fingerprint represents multiple fingerprints in one fingerprint table, and all index fingerprints of each fingerprint table represent The sum of the fingerprints is all the fingerprints contained in the fingerprint table.
  • Ti in FIG. 5 denotes an i-th fingerprint table
  • Li denotes an i-th index fingerprint of the fingerprint table.
  • FP 14 is a second index fingerprint of the first fingerprint table, which represents FP 14 to FP 17 in the first fingerprint table. Fingerprints (excluding FP 17 ), specifically FP 14 , FP 15 , FP 16 .
  • FIG. 6 is a schematic diagram of the index fingerprint corresponding to the fingerprint table 1.
  • the fingerprints FP 1 to FP 9 in FIG. 6 are sequentially increased, and the fingerprint table 1 can be The fingerprint is divided into three parts: FP 1 to FP 3 , FP 4 to FP 6 , and FP 7 to FP 9 .
  • FP 1 , FP 4 , and FP 7 are used as the three index fingerprints of the fingerprint table, and the FP in the index fingerprint table is included.
  • FP 1 represents FP 1 to FP 3 in the fingerprint table
  • FP 4 in the index fingerprint table represents FP 4 to FP 6 in the fingerprint table
  • FP 7 in the index fingerprint table represents FP 7 to FP 9 in the fingerprint table. Since the fingerprint sizes of FP 1 to FP 9 are sequentially increased, the fingerprint ranges of the plurality of fingerprints represented by FP 1 , FP 4 , and FP 7 in the index fingerprint table do not overlap.
  • the fingerprint range of the plurality of fingerprints represented by each index fingerprint may be determined from the index fingerprint table, and the implementation manner thereof may be: first, the first of the plurality of fingerprints in the fingerprint table.
  • the fingerprint may determine the fingerprint range of the plurality of fingerprints represented by the previous index fingerprint according to the adjacent two index fingerprints of the index fingerprint corresponding to the fingerprint table, and follow the fingerprint table 1 of FIG. 6 according to coupled index fingerprint FP 1, a plurality of fingerprint range may be determined fingerprint FP 4 FP 1 is represented by [FP 1, FP 4).
  • the index fingerprint table includes a fingerprint range attribute, and the fingerprint range of the plurality of fingerprints represented by the index fingerprint is saved for each index fingerprint.
  • step 101 the processor 211 of the backup server 21 determines, according to the size of the fingerprint of the data block to be stored, the fingerprint range of the plurality of fingerprints represented by the fingerprint index table, including the index fingerprint of the fingerprint of the data block to be stored, and the determined index.
  • the collection of fingerprints is a first fingerprint set.
  • the first fingerprint table and the second fingerprint table are taken as an example in the embodiment of the present invention, but the first fingerprint table and the second fingerprint table are not included in the storage system in the embodiment of the present invention.
  • the operation of step 101 is performed only on the index fingerprint corresponding to the first fingerprint table and the second fingerprint table in the fingerprint index table.
  • the index fingerprint corresponding to each fingerprint table is performed in step 101. Since the fingerprint ranges corresponding to the plurality of index fingerprints of each fingerprint table do not overlap, for each fingerprint table, correspondingly At most one index fingerprint can determine an index fingerprint.
  • step 102 is executed, and the backup server 21 determines the probability that each fingerprint table contains the same fingerprint as the data block to be stored. Since the fingerprint range has been determined from each fingerprint table in step 101, the corresponding fingerprint range includes the data block to be stored. If the fingerprint is indexed, the fingerprint represented by the other index fingerprints of the fingerprint table must not contain the fingerprint of the data block to be stored. Therefore, the probability that the first fingerprint table contains the same fingerprint as the data block to be stored is substantially determined in step 101.
  • the plurality of fingerprints represented by the first index fingerprint include the same fingerprint as the data block to be stored.
  • the probability that the second fingerprint table contains the same fingerprint as the data block to be stored is substantially determined in step 101.
  • the plurality of fingerprints represented by the second index fingerprint include a probability of the same fingerprint as the data block to be stored.
  • the processor 211 sends the fingerprint of the data block to be stored and the index fingerprint in the first fingerprint set to the auxiliary memory 213, and the auxiliary memory 213 determines from the processing unit 214 through the processing unit 214.
  • the probability of retrieving the fingerprint of the data block to be stored among the plurality of fingerprints represented by each of the index fingerprints in the first fingerprint set, and then the auxiliary memory 213 returns the determined probability value to the processor 211.
  • the auxiliary storage 213 may specifically be based on the statistical information of the plurality of fingerprints represented by the received index fingerprint (eg, the distribution information of the fingerprint near the value of the fingerprint of the data block to be stored, and the frequency statistics of the fingerprint value in each fingerprint dimension) , etc.) to determine the probability that the fingerprints represented by the index fingerprint contain the same fingerprint as the fingerprint of the data block to be stored.
  • the statistical information of the plurality of fingerprints represented by the received index fingerprint eg, the distribution information of the fingerprint near the value of the fingerprint of the data block to be stored, and the frequency statistics of the fingerprint value in each fingerprint dimension
  • the processor may only transmit to the auxiliary memory 213 an index fingerprint associated with the fingerprint table held by the auxiliary memory in the first fingerprint set.
  • the backup server 21 includes a first auxiliary storage and a second auxiliary storage, wherein the first auxiliary storage stores a first fingerprint table, the second auxiliary storage stores a second fingerprint table, and the processor 211 stores the fingerprint of the data block to be stored. And sending the first index fingerprint to the first auxiliary storage, and sending the fingerprint of the data block to be stored and the second index fingerprint to the second auxiliary storage.
  • the first auxiliary memory determines the first probability by its processing unit and returns to the processor 211; the second auxiliary memory determines the second probability by its processing unit and returns to the processor 211.
  • the backup server 21 determines a second fingerprint set according to the received probability of retrieving the fingerprint of the data block to be stored from each fingerprint table, and the second fingerprint set includes the first fingerprint set that satisfies the pre- Setting a conditional index fingerprint
  • the preset condition is: a probability of retrieving a fingerprint of the data block to be stored in the plurality of fingerprints represented by the index fingerprint (ie, determining, in step 102, storing the plurality of fingerprints represented by the index fingerprint.
  • the probability that the fingerprint of the fingerprint table includes the same fingerprint as the fingerprint of the data block to be stored is greater than a preset threshold.
  • the value of the preset threshold may be 0, that is, a part of the index fingerprint is removed from the first fingerprint set to form a second fingerprint set, and the culled index fingerprint is a probability of retrieving the data block to be stored in the plurality of fingerprints it represents. This part of the index is 0.
  • the second probability that the auxiliary memory 213 returns to the processor 211 to retrieve the fingerprint of the data block to be stored in the plurality of fingerprints represented by the second index fingerprint is less than a preset threshold, and the first index fingerprint represents more The first probability that the fingerprint of the data block to be stored is retrieved is greater than a preset threshold, the first index fingerprint is included in the second fingerprint set, and the second index fingerprint is not included in the first Among the two fingerprint sets.
  • the backup server 21 performs fingerprint comparison on the plurality of fingerprints represented by the index fingerprints in the second fingerprint set to obtain a fingerprint comparison result, thereby determining whether the data block to be stored is duplicate data.
  • the processor 211 reads the plurality of fingerprints represented by the index fingerprints in the second fingerprint set in the fingerprint table saved in the auxiliary memory 213 into the memory 212, and then The fingerprints are compared in the memory 212.
  • the auxiliary memory 213 performs the comparison of the fingerprints by the processing unit 214 included in the auxiliary memory 213, that is, the auxiliary memory 213 represents, by its processing unit 214, the index fingerprint in the second fingerprint set in the fingerprint table stored by itself.
  • a plurality of fingerprints are read into the buffer of the auxiliary memory 213 for fingerprint matching.
  • the buffer of the auxiliary memory 213 may be a random access memory (English: Random Access Memory; RAM: for short) or a cache.
  • the preset threshold in step 103 may also be a probability value greater than 0, and the backup server 21 firstly among the multiple fingerprints represented by the index fingerprint in the second fingerprint set determined according to the preset threshold greater than 0.
  • the backup server 21 performs in the multiple fingerprints represented by the index fingerprint corresponding to the probability value between (0, preset threshold). Fingerprint comparison. That is, the backup server first performs fingerprint comparison among the plurality of fingerprints represented by the index fingerprints with high probability. When the comparison result cannot confirm the repeatability of the data block to be stored, the plurality of index fingerprints represented by the smaller probability are represented. Fingerprint comparison in the fingerprint can effectively reduce the time consuming of the fingerprint comparison.
  • the value of the preset threshold in step 103 is 0, and the index fingerprint whose corresponding probability value is greater than 0 is determined to be an element of the second fingerprint set, and then the backup server 21 can perform the second fingerprint.
  • the index fingerprints in the set are sorted according to their corresponding probability values.
  • the processor 211 reads the multiple fingerprints represented by the index fingerprints in the second fingerprint set into the memory for fingerprint retrieval, the fingerprint is determined according to the probability values of the index fingerprints. The order of reading. That is, firstly, multiple fingerprints represented by the index fingerprint corresponding to the largest probability value are read, and it is impossible to determine whether the fingerprint to be stored is based on the fingerprint.
  • the fingerprints represented by the index fingerprints ranked second in the probability value are read into the memory for fingerprint comparison. And so on, until the same fingerprint as the fingerprint of the data block to be stored is retrieved; or, it is determined that all the fingerprints represented by the index fingerprint whose probability values are greater than 0 do not contain the same fingerprint as the data block to be stored. fingerprint.
  • the former case indicates that the data block to be stored is duplicate data, and the latter case indicates that the data block to be stored is new data.
  • the backup server 21 determines, by using the fingerprint index table, a plurality of fingerprints represented by an index fingerprint in each fingerprint table, and the fingerprint range of the plurality of fingerprints represented by the index fingerprint includes multiple fingerprints of the data block to be stored. The fingerprint is then subjected to the fingerprint matching operation based on the plurality of fingerprints determined from each fingerprint table, thereby reducing the workload of fingerprint handling.
  • the probability of retrieving the fingerprint of the data block to be stored in the (determined plurality of fingerprints) is calculated for each fingerprint table, and then among the determined plurality of fingerprints of the fingerprint table whose probability is not less than the preset threshold
  • the fingerprint comparison is performed to reduce the amount of fingerprints carried during the fingerprint comparison, reduce the time consumption of the fingerprint comparison, and improve the efficiency of storing the data.
  • FIG. 7 is a schematic structural diagram of a storage system 30 corresponding to Embodiment 2, and the storage system 30 includes a backup server 31 and a plurality of memories 32. In the second embodiment, steps 101 to 104 are executed by the backup server 31.
  • the memory 32 is configured to store a data block and a fingerprint table.
  • the fingerprint table may be in the form of a fingerprint table.
  • the fingerprint table stored in the memory 32 may be formed by fingerprint information corresponding to the data block stored in the memory, and the fingerprint stored in the memory 32.
  • the table may also be fingerprint information of other data blocks that are independent of the data blocks stored by the memory 32 itself.
  • the backup server 31 includes a processor 311 and a memory 312.
  • the backup server 31 stores a fingerprint index table.
  • the fingerprint index table may be stored in the memory 312 or may be stored in other storage units of the backup server 31.
  • the manner in which the processor 311 determines the first fingerprint set is the same as the manner in which the processor 211 determines the first fingerprint set, which is not repeated in the embodiment of the present invention.
  • the backup server 31 performs step 102 to obtain a probability of retrieving the fingerprint of the data block to be stored in each fingerprint table (the corresponding plurality of fingerprints represented by the index fingerprints belonging to the first fingerprint set).
  • the specific implementation includes: first, the memory 32 storing the fingerprint table includes a processing unit 321, similar to the probability value obtained by the backup server 21, and the backup server 31 may send the fingerprint of the data block to be stored and the index fingerprint in the first fingerprint set. To the memory 32, the memory 32 determines the above probability value by its own processing unit 321 and its own saved fingerprint statistical information, and then transmits the probability value to the backup server 31. Second, the processor 311 reads the statistical information of the fingerprint stored on the memory 32 into the memory 312, and the processor 311 itself determines the probability value according to the statistical information stored in the memory 312.
  • step 103 which is implemented in the same manner as the foregoing backup server 21 performs step 103.
  • the backup server 31 performs step 104 to obtain a result of comparing the fingerprints of the data blocks to be stored among the plurality of fingerprints represented by the index fingerprints of the second fingerprint set. Similar to step 104 performed by the backup server 21, the backup server 31 may read, by the processor 311, a plurality of fingerprints represented by the index fingerprints in the second fingerprint set stored in the memory 32 into the memory for fingerprint comparison, that is, the fingerprint. The work of the comparison is done by the backup server 31 itself.
  • another way for the backup server 31 to obtain the fingerprint comparison result is that the backup server 31 sends the index fingerprint in the second fingerprint set and the fingerprint of the data block to be stored to the memory 32, and the memory 32 passes through its own processing unit 321 Fingerprint comparison is performed locally in the memory 32. This method can reduce data handling, and the plurality of memories 32 can perform fingerprint comparison in parallel, which can improve the efficiency of fingerprint comparison.
  • An implementation manner of the fingerprint comparison according to the step 103 to the step 104 is: the value of the preset threshold in step 103 is 0, and the index fingerprint whose corresponding probability value is greater than 0 is determined as the element of the second fingerprint set, and then the backup server
  • the index fingerprints in the second fingerprint set may be sorted according to their corresponding probability values, and the processor 311 reads the multiple fingerprints represented by the index fingerprints in the second fingerprint set into the memory for fingerprint retrieval, according to the index fingerprint.
  • the probability value ordering determines the order in which the fingerprints are read.
  • a plurality of fingerprints represented by the index fingerprint corresponding to the largest probability value are read, and after the fingerprint cannot be stored according to the fingerprint, whether the fingerprint is to be stored as duplicate data, and then the probability value is ranked in the index fingerprint of the second digit.
  • Multiple fingerprints are read into the memory for fingerprint matching. And so on, until the data block to be stored is retrieved The same fingerprint is fingerprinted; or, it is determined that none of the plurality of fingerprints represented by the index fingerprint whose probability values are greater than 0 contain the same fingerprint as the fingerprint of the data block to be stored.
  • the former case indicates that the data block to be stored is duplicate data
  • the latter case indicates that the data block to be stored is new data.
  • the fingerprint index table may further include location information of the first fingerprint table to which the first index fingerprint belongs, that is, the first fingerprint table.
  • the processor 211 only needs to locate the auxiliary storage storing the first fingerprint table according to the position information corresponding to the first index fingerprint, corresponding to the first index fingerprint and the waiting A fingerprint storing the data block is sent to the auxiliary memory such that the auxiliary memory determines a probability of including the same fingerprint of the fingerprint of the data block to be stored in the first fingerprint table.
  • the fingerprint index table may further include location information of the first fingerprint table to which the first index fingerprint belongs, that is, an identifier of the first memory in which the first fingerprint table is saved, and the backup server 31 performs steps.
  • the processor 311 only needs to locate the first memory according to the location information corresponding to the first index fingerprint, and correspondingly send the first index fingerprint and the fingerprint of the data block to be stored to the first memory, so that the first memory is determined to be in the first
  • a fingerprint table includes the probability that the fingerprint of the data block to be stored is the same fingerprint.
  • the first fingerprint table is stored in a first memory of the plurality of memories
  • the second fingerprint table is stored in a second memory of the plurality of memories.
  • step 102 a first probability that the first fingerprint table includes the same fingerprint as the fingerprint of the data block to be stored is obtained according to the first index fingerprint, and the implementation includes the following steps:
  • the plurality of fingerprints of the table contain the probability of the same fingerprint as the fingerprint of the data block to be stored.
  • a second probability that the second fingerprint table includes the same fingerprint as the fingerprint of the data block to be stored is obtained according to the second index fingerprint, and the implementation includes the following steps: the fingerprint of the data block to be stored and the second index The fingerprint is sent to the second memory; receiving a second probability returned by the second memory, the second probability being used to indicate a probability that the plurality of fingerprints represented by the second index fingerprint contain the same fingerprint as the fingerprint of the data block to be stored.
  • the above manner corresponds to the case where the backup server 31 executes the step 102 in the second embodiment, and the first memory and the second memory are two memories 32, and the fingerprint matching can be performed according to the processing unit 321 included in the first memory.
  • the specific embodiment thereof has been described in detail in Embodiment 2 and will not be repeated here.
  • the backup server includes auxiliary storage, and the first fingerprint table and the second fingerprint table are stored in the auxiliary storage.
  • step 102 a first probability that the fingerprint corresponding to the fingerprint of the data block to be stored is included in the first fingerprint table is obtained according to the first index fingerprint, and the second fingerprint table is included and stored in the second fingerprint table according to the second index fingerprint.
  • the second probability of the fingerprint of the same fingerprint of the data block is implemented as follows:
  • the plurality of fingerprints represented by the first index fingerprint returned by the receiving auxiliary memory include the same fingerprint as the data block to be stored.
  • the first probability of the fingerprint, and the second probability that the fingerprint represented by the second index fingerprint contains the same fingerprint as the fingerprint of the data block to be stored.
  • the above-mentioned manner corresponds to the case where the backup server 21 performs the step 102 in the first embodiment, and the auxiliary storage in the embodiment is the auxiliary storage 213 in the first embodiment, and can perform fingerprint comparison according to the processing unit 214 included in the first embodiment. .
  • the specific embodiment has been described in detail in Embodiment 1, and will not be repeated here.
  • each fingerprint in the first fingerprint table includes M bits, and each M-bit fingerprint includes N intervals, and each of the N intervals includes consecutive S bits in the M bits. Any two of the N intervals do not overlap, and the sum of the number of bits in the N intervals is M, and N is greater than or equal to 2
  • the natural number, S is a natural number.
  • each fingerprint in the fingerprint table can be divided into N sections, each section corresponding to one fingerprint dimension.
  • a 64-bit fingerprint can be divided into four-dimensional 16-bit values, that is, 1 bit to 16 bits are in the first dimension, 17 bits to 32 bits are in the second dimension, 33 bits to 48 bits are in the third dimension, and 49 bits to 64 bits are in the fourth dimension.
  • the number of bits occupied by each dimension value is not limited, and the number of bits occupied by all dimensions is not limited.
  • the storage system stores a first statistical table, where the first statistical table includes statistical information of values of the plurality of fingerprints represented by the first index fingerprint in the N intervals.
  • 8 is a schematic diagram of a first statistical table. It may be assumed that the three fingerprints represented by the first index fingerprint are respectively 01020504H, 01030504H, and 02030102H, and each fingerprint is divided into 4 dimensions, with 01025004H as an example, and four fingerprint dimensions thereof. The values are 1, 2, 5, and 4, respectively.
  • the first statistic records the frequency of occurrence of possible values in each dimension in the corresponding dimension of the plurality of fingerprints represented by the first index fingerprint, for example, the frequency 2 of the value "1" appearing in the first dimension, the value " 2"
  • the frequency 1 appearing in the first dimension, the values of the values "3", "4", "5" appearing in the first dimension are all 0.
  • Determining the probability of a first embodiment comprises: a frequency according to the value T i of the i tables to determine a first interval a i represents the index of the first plurality of fingerprints in a fingerprint occurs, where, a i is the data block to be stored The value of the i-th interval of the fingerprint, i ranges from 1 to N; the first probability is determined according to the obtained minimum value of t 1 to t N .
  • the first statistical table may be stored on the auxiliary storage 213 holding the first fingerprint table, and the first probability is determined by the auxiliary storage 213 through its own processing unit 214.
  • the fingerprint of the stored data block may be desirable to set the fingerprint of the stored data block to 01020404H, and the values of the fingerprint in the four fingerprint dimensions are 1, 2, 4, and 4, respectively.
  • Calculating the probability index of the fingerprint to be retrieved in the table block shown in FIG. 5 is: the frequency of searching for the value 1 in the first dimension in the table is 2, and the frequency of searching the value 2 in the second dimension is 1, The frequency of finding the value 4 in the third dimension is 0, and the frequency of finding the value 4 in the fourth dimension is 2, and the probability index is the minimum value of 0 in the frequency.
  • the first statistical table may be stored in the memory 32 storing the first fingerprint table, and the first probability is determined by the memory through its processing unit 321, and the specific determination method is as described above.
  • the processing unit 214 determines the first probability in the same manner according to the first statistical table.
  • the second probability may also be determined in the above manner by using a statistical table of fingerprints, and the embodiment of the present invention is not repeated here.
  • the first probability of retrieving the fingerprint of the data block to be stored in the plurality of fingerprints represented by the first index fingerprint is determined by the first statistical table, and the implementation manner is simple, the calculation amount is small, and the time is small. And the results are accurate.
  • FIG. 9 is a schematic diagram of another first statistic table, where the first statistic table includes a plurality of fingerprints represented by the first index fingerprint.
  • the determining manner of the first probability includes: determining, according to the first statistical table, the appearance frequencies t 1 and b of the first interval of the plurality of fingerprints represented by the first index fingerprint in the first index fingerprint a frequency t 2 occurring in the value of the second interval of the fingerprint, where a is the value of the hth to the i thth of the fingerprint of the data block to be stored, and b is the jth to kth of the fingerprint of the data block to be stored The value of the bit; the first probability is determined based on the minimum of t 1 and t 2 .
  • the first statistical table includes only the statistical information of the partial dimensions of the fingerprint, and the first index fingerprint represents the three fingerprints of 01020504H, 01030504H, and 02030102H, and the first statistical table may only Saving the first dimension and the third dimension of the statistical information, the fingerprint of the stored data block may be set to 01020404H, and the third dimension is "4". According to the first statistical table of FIG.
  • the frequency of occurrence of "4" in the third dimension may be determined as 0, indicating that the fingerprint of the data block to be stored is not included in the plurality of fingerprints represented by the first index fingerprint.
  • the first statistical table in this embodiment may be stored in the auxiliary memory 213 of the storage system 20, and the first probability is saved by the auxiliary storage.
  • the reservoir 213 is determined according to its own processing unit 214.
  • the first statistical table may also be stored in the memory 32 in the storage system 30, and the first probability is determined by the processing unit 321 of the memory 32 itself.
  • the first probability of retrieving the fingerprint of the data block to be stored in the plurality of fingerprints represented by the first index fingerprint is determined by the statistical information of the partial dimensions of the fingerprint, so as to remove the corresponding portion from the first fingerprint set.
  • the index fingerprint with a probability value of 0 reduces the amount of data transfer when the fingerprint is compared, and the implementation method is simple, the calculation amount is small, and the time is small.
  • the backup server 21 or the backup server 31 determines that the data block to be stored is new data.
  • the backup server 21 or the backup server 31 determines that the data block to be stored is new data.
  • the backup server before performing step 101, further includes the following steps: performing fingerprint filtering on the fingerprint of the stored data block, and determining, by using fingerprint filtering, whether the fingerprint of the to-be-stored data block is a duplicate fingerprint .
  • the fingerprint may be pre-judged according to the fingerprint filtering technology, and the result of the pre-judgment includes three types. First, determining that there is data to be stored in the fingerprint table The fingerprint of the block, the data block to be stored is duplicate data; second, it is determined that the fingerprint table does not contain the fingerprint of the data block to be stored, and the data block to be stored is new data; third, it cannot be asserted whether the fingerprint table contains the data block to be stored. The fingerprint, only in this case, the backup server performs steps 101 to 104.
  • a fingerprint filtering technology such as a Bloom Filter or a Locality Preserved Caching (LPC) technology may be used, or a combination of two or more fingerprint filtering technologies may be used.
  • the Bloom filter is used to filter the fingerprint. If the Bloom filter cannot determine whether the fingerprint of the data block to be stored is a duplicate fingerprint, the LPC technology is further used for filtering.
  • the fingerprint filtering technology refer to the prior art, which is not described in detail in the embodiments of the present invention.
  • the backup server first predicts the fingerprint of the stored data block by using the fingerprint filtering technology, and only steps 101 to 104 are performed if the fingerprint to be retrieved cannot be predicted.
  • the fingerprint filtering technology can greatly shorten the time-consuming comparison of some fingerprints and improve the performance of the backup server.
  • the manner of maintaining the fingerprint table may be: creating a new fingerprint table in the memory 212 in real time, and determining that the currently stored fingerprint table does not include the fingerprint of the data block to be stored.
  • the data block is stored as new data, and its fingerprint is added to the fingerprint table created by the memory implementation.
  • the fingerprint table is stored on the auxiliary storage 213.
  • each index fingerprint corresponding to the fingerprint table is added to the index fingerprint table.
  • the statistical table of the dimension values of the plurality of fingerprints corresponding to each index fingerprint corresponding to the fingerprint table is stored in the auxiliary storage 213.
  • the backup server when the fingerprint table is maintained in the above manner, when the backup server performs the fingerprint comparison, the backup server first compares the fingerprints in the created fingerprint table, and the comparison result cannot determine whether the to-be-stored data block is a duplicate fingerprint, and then the steps are performed. 101 to step 104, performing fingerprint comparison in a fingerprint table stored outside the memory.
  • the fingerprint table may be maintained by adding a fingerprint of the data block to the fingerprint table of the memory 32 storing the data block when determining that the data block to be stored is new data, and then, Update the statistics table of the fingerprint table.
  • processor 211, the processor 311, the processing unit 214, and the processing unit 321 may be an independent processor or a collective name of multiple processing elements.
  • the processor 211, the processor 311, the processing unit 214, and the processing unit 321 may be a central processing unit (English: Central Processing Unit; CPU), or may be a specific integrated circuit (English: Application Specific Intergrated Circuit; ASIC), or one or more integrated circuits configured to implement embodiments of the present invention, such as: one or more microprocessors (English: digital singnal processor; referred to as: DSP), or one or more on-site Programming Gate Array (English: Field Programmable Gate Array; referred to as: FPGA).
  • an embodiment of the present invention provides a backup server 40, which is applied to a storage system, where the storage system includes a backup server and a plurality of memories.
  • the storage system stores multiple fingerprint tables, and records in multiple fingerprint tables. There are fingerprints of data blocks that have been stored in multiple memories.
  • FIG. 10 is a schematic block diagram showing the structure of the backup server 40.
  • the backup server 40 includes:
  • the determining module 41 is configured to determine the first fingerprint set according to the index fingerprint in the fingerprint index table and the fingerprint of the data block to be stored, where the first fingerprint set includes a first index fingerprint, a second index fingerprint, and a first index fingerprint.
  • the second index fingerprint is used to represent multiple fingerprints in the second fingerprint table, the fingerprint of the data block to be stored belongs to multiple fingerprints represented by the first index fingerprint, and the second The fingerprint range of multiple fingerprints represented by the index fingerprint;
  • the obtaining module 42 is configured to obtain, according to the first index fingerprint, a first probability that the first fingerprint table includes the same fingerprint as the fingerprint of the data block to be stored, and obtain the second fingerprint table according to the second index fingerprint. a second probability of storing a fingerprint of the same fingerprint of the data block, wherein the first probability is determined according to the plurality of fingerprints represented by the first index fingerprint, and the second probability is determined according to the plurality of fingerprints represented by the second index fingerprint;
  • the determining module 41 is further configured to determine a second fingerprint set according to the first probability and the second probability, where the second fingerprint set includes at least a first index fingerprint, and the first probability determined according to the first index fingerprint is not less than a preset Threshold value
  • the processing module 43 is configured to obtain a matching result of the plurality of fingerprints represented by the first index fingerprint and the fingerprint of the data block to be stored.
  • the first fingerprint table is stored in a first memory of the plurality of memories
  • the second fingerprint table is stored in the second memory of the plurality of memories
  • the obtaining module 42 is specifically configured to: send the fingerprint of the data block to be stored and the first index fingerprint to the first memory; and receive a first probability returned by the first memory, where the first probability is used to represent the fingerprint represented by the first index The probability that a plurality of fingerprints contain the same fingerprint as the fingerprint of the data block to be stored;
  • the second probability is used to indicate a probability that the plurality of fingerprints represented by the second index fingerprint contain the same fingerprint as the fingerprint of the data block to be stored.
  • the backup server 40 further includes:
  • a secondary storage for storing the first fingerprint table and the second fingerprint table
  • the obtaining module 42 is specifically configured to: send the fingerprint of the data block to be stored, and the first index fingerprint and the second index fingerprint to the auxiliary storage; and receive, by the auxiliary storage, the plurality of fingerprints represented by the first index fingerprint A first probability of storing a fingerprint of the same fingerprint of the data block, and a second probability of including the same fingerprint as the fingerprint of the data block to be stored in the plurality of fingerprints represented by the second index fingerprint.
  • the auxiliary storage is further configured to store a first statistical table, where the first statistical table includes statistical information of a value of the first interval of the plurality of fingerprints represented by the first index fingerprint, and the first index.
  • the statistical information of the value of the second interval of the plurality of fingerprints represented by the fingerprint, the first interval is the interval from the hth to the ithth of each fingerprint, and the second interval is the interval from the jth to the kth of each fingerprint
  • h, i, j, and k are all natural numbers, the value of h is not greater than the value of i, and the value of j is not greater than the value of k, and the first interval and the second interval do not overlap;
  • the auxiliary storage is further configured to: determine, according to the first statistical table, the appearance frequency t 1 of the first interval of the plurality of fingerprints represented by the first index fingerprint and the plurality of fingerprints represented by the first index fingerprint a frequency t 2 occurring in the value of the second interval, where a is the value of the hth to the i thth of the fingerprint of the data block to be stored, and b is the jth to kth bits of the fingerprint of the data block to be stored a value; and determining a first probability based on a minimum of t 1 and t 2 .
  • each fingerprint in the first fingerprint table includes M bits, and each M-bit fingerprint includes N intervals, and each of the N intervals includes consecutive S bits in the M bits. Any two of the N intervals do not overlap, the sum of the number of bits in the N intervals is M, N is a natural number greater than or equal to 2, and S is a natural number;
  • the auxiliary memory is also used to store the first statistical table, the first statistic The table includes statistical information of values of N intervals of the plurality of fingerprints represented by the first index fingerprint;
  • Auxiliary memory further configured to: determine the presence numerical interval i a i plurality of fingerprints in the fingerprint represented by the first index in accordance with a first frequency tables T i, where a i is the fingerprint data to be stored in block The value of the i-th interval, i ranges from 1 to N, and determines the first probability based on the minimum of t 1 to t N .
  • the data processing method corresponding to the backup server 40 in this embodiment and FIG. 3 is based on two aspects under the same inventive concept.
  • the implementation process of the method has been described in detail above, so those skilled in the art can refer to the foregoing description.
  • the structure and implementation process of the backup server in this embodiment are clearly understood. For the sake of brevity of the description, details are not described herein again.
  • a storage system including a backup server and a plurality of memories.
  • the storage system stores a plurality of fingerprint tables, and the plurality of fingerprint tables are recorded in the plurality of memories.
  • the fingerprint of the data block is provided in the embodiment of the present invention, including a backup server and a plurality of memories.
  • This backup server is used to:
  • the first fingerprint set is determined according to the index fingerprint in the fingerprint index table and the fingerprint of the data block to be stored, where the first fingerprint set includes a first index fingerprint and a second index fingerprint, and the first index fingerprint is used to represent the first fingerprint.
  • the second index fingerprint is used to represent multiple fingerprints in the second fingerprint table, and the fingerprint of the data block to be stored belongs to multiple fingerprints represented by the first index fingerprint and the second index fingerprint represents Fingerprint range of fingerprints;
  • the second fingerprint set includes at least a first index fingerprint, and the first probability determined according to the first index fingerprint is not less than a preset threshold
  • a matching result of the plurality of fingerprints represented by the first index fingerprint and the fingerprint of the data block to be stored is obtained.
  • the first fingerprint table is stored in the first memory of the plurality of memories
  • the second fingerprint table is stored in the second memory of the plurality of memories
  • the backup server is specifically configured to:
  • the probability that the fingerprint of the data block to be stored is the same fingerprint
  • the first memory is specifically configured to: receive a first index fingerprint sent by the backup server and a fingerprint of the data block to be stored, and determine, in the plurality of fingerprints represented by the first index fingerprint, a fingerprint that is the same as a fingerprint of the data block to be stored. a probability and send the first probability to the backup server;
  • the second memory is specifically configured to: receive a second index fingerprint sent by the backup server, and a fingerprint of the data block to be stored, and determine that the fingerprints included in the second index fingerprint include the same fingerprint as the fingerprint of the data block to be stored. Two probabilities and send the second probability to the backup server.
  • the first memory includes a first statistical table, where the first statistical table includes statistical information of a value of the first interval of the plurality of fingerprints represented by the first index fingerprint, and the first index.
  • the statistical information of the value of the second interval of the plurality of fingerprints represented by the fingerprint, the first interval is the interval from the hth to the ithth of each fingerprint, and the second interval is the interval from the jth to the kth of each fingerprint
  • h, i, j, and k are all natural numbers, the value of h is not greater than the value of i, and the value of j is not greater than the value of k, and the first interval and the second interval do not overlap;
  • the first memory is specifically used to:
  • the first probability is determined based on the minimum of t 1 and t 2 .
  • each fingerprint in the first fingerprint table includes M bits, and each M-bit fingerprint includes N intervals, and each of the N intervals includes consecutive S bits in the M bits. Any two of the N intervals do not overlap, and the sum of the number of bits in the N intervals is M, and N is greater than or equal to 2 a natural number, S is a natural number;
  • a first statistical table is stored on the first memory, and the first statistical table includes statistical information of numerical values of N intervals of the plurality of fingerprints represented by the first index fingerprint;
  • a first memory configured to: Numerical occurrence frequency interval i a i plurality of fingerprints in the fingerprint represented by the first index in the first T i determined in accordance with tables, wherein a i is the fingerprint data blocks to be stored The value of the i-th interval, i ranges from 1 to N;
  • the first probability is determined based on the minimum of t 1 to t N .
  • the data processing method corresponding to FIG. 3 in the storage system in this embodiment is based on two aspects under the same inventive concept.
  • the implementation process of the method has been described in detail above, so that those skilled in the art can clearly understand according to the foregoing description.
  • the structure and implementation process of the storage system in this embodiment are understood. For the sake of brevity of the description, details are not described herein again.
  • the backup server determines, by using the fingerprint index table, a plurality of fingerprints represented by an index fingerprint in each fingerprint table, and the fingerprint range of the plurality of fingerprints represented by the index fingerprint includes a fingerprint of the data block to be stored, and then based on the The fingerprints determined in each fingerprint table perform the next fingerprint matching operation, which reduces the workload of fingerprint handling.
  • the probability of retrieving the fingerprint of the data block to be stored in the (determined plurality of fingerprints) is calculated for each fingerprint table, and then among the determined plurality of fingerprints of the fingerprint table whose probability is not less than the preset threshold
  • the fingerprint comparison is performed to reduce the amount of fingerprints carried during the fingerprint comparison, and the time consumption of the fingerprint comparison is reduced, that is, the time for determining whether the data block to be stored is a duplicate data block is reduced, and the efficiency of storing the data is improved.
  • the embodiment of the invention further provides a computer program product for data processing, comprising a computer readable storage medium storing program code, the program code comprising instructions for executing the method flow described in any one of the foregoing method embodiments.
  • a person skilled in the art can understand that the foregoing storage medium includes: a USB flash drive, a mobile hard disk, a magnetic disk, an optical disk, a random access memory (RAM), a solid state disk (SSD), or a nonvolatile.
  • a non-transitory machine readable medium that can store program code, such as a non-volatile memory.

Abstract

一种数据处理方法、备份服务器(11,21,31,40)及存储系统(10,20,30),用于解决因指纹比对消耗大量I/O资源导致数据存储的效率较低的问题。该数据处理方法包括:根据指纹索引表中的索引指纹以及待存储数据块的指纹确定第一指纹集合(101);根据所述第一索引指纹获得第一指纹表中包含有与所述待存储数据块的指纹相同的指纹的第一概率,并根据所述第二索引指纹获得所述第二指纹表中包含有与所述待存储数据块的指纹相同的指纹的第二概率(102);根据所述第一概率和第二概率确定第二指纹集合(103);获得所述第一索引指纹所代表的多个指纹与所述待存储数据块的指纹的匹配结果(104)。

Description

一种数据处理方法、备份服务器及存储系统 技术领域
本发明涉及计算机技术领域,特别涉及一种数据处理方法、备份服务器及存储系统。
背景技术
在数据存储领域,重复数据删除技术是一种节约数据存储空间的关键技术,它可以检测并消除数据冗余,对相同的数据只留下一个副本,不仅可以较大地节约磁盘空间,而且能够提升数据的写入性能,以及节约网络带宽,被广泛应用于文件备份、在线存储服务、电子邮件服务等领域。
现有技术中,存储系统中存储有指纹表,指纹表中保存有已存储在存储系统中的数据的指纹。在接收到数据存储请求时,存储系统将待存储数据块的指纹与指纹表中的指纹进行比对,以确定待存储数据块是否为重复数据,进而确定对待存储数据块的存储方式。
但是,将待存储数据块的指纹与指纹表中的指纹进行比对时,需要将指纹表中的指纹读取到备份服务器的内存中,由于指纹表中保存有海量的指纹,因此,在全部指纹表中进行指纹比对将产生大量的数据搬运,需要消耗大量的输入输出(英文:input/output;简称:I/O)资源,花费大量的时间,导致存储系统存储数据的效率低下。
发明内容
本发明实施例提供一种数据处理方法、备份服务器及存储系统,用于解决因指纹比对消耗大量I/O资源导致数据存储的效率较低的问题。
第一方面,本发明实施例提供一种数据处理方法,所述方法由存储系统中的备份服务器执行,所述存储系统中包括所述备份服务器以及多个存储器,所 述存储系统中存储有多个指纹表,所述多个指纹表中记录有已存储于所述多个存储器中的数据块的指纹,所述方法包括:
根据指纹索引表中的索引指纹以及待存储数据块的指纹确定第一指纹集合,其中,所述第一指纹集合中包含有第一索引指纹、第二索引指纹,所述第一索引指纹用于代表第一指纹表中的多个指纹,所述第二索引指纹用于代表第二指纹表中的多个指纹,所述待存储数据块的指纹属于所述第一索引指纹所代表的多个指纹以及所述第二索引指纹所代表的多个指纹的指纹范围;
根据所述第一索引指纹获得第一指纹表中包含有与所述待存储数据块的指纹相同的指纹的第一概率,并根据所述第二索引指纹获得所述第二指纹表中包含有与所述待存储数据块的指纹相同的指纹的第二概率,其中,所述第一概率是根据所述第一索引指纹代表的多个指纹确定的,所述第二概率是根据所述第二索引指纹代表的多个指纹确定的;
根据所述第一概率和第二概率确定第二指纹集合,其中,所述第二指纹集合中至少包含有所述第一索引指纹,根据所述第一索引指纹确定的第一概率不小于预设阈值;
获得所述第一索引指纹所代表的多个指纹与所述待存储数据块的指纹的匹配结果。
结合第一方面,在第一方面的第一种可能的实现方式中,所述第一指纹表存储在所述多个存储器中的第一存储器中,所述第二指纹表存储在所述多个存储器中的第二存储器中;所述根据所述第一索引指纹获得第一指纹表中包含有与所述待存储数据块的指纹相同的指纹的第一概率,包括:
将所述待存储数据块的指纹以及所述第一索引指纹发送至所述第一存储器;
接收所述第一存储器返回的所述第一概率,所述第一概率用于表示在所述第一索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的概率;
所述根据所述第二索引指纹获得所述第二指纹表中包含有与所述待存储数据块的指纹相同的指纹的第二概率,包括:
将所述待存储数据块的指纹以及所述第二索引指纹发送至所述第二存储器;
接收所述第二存储器返回的所述第一概率,所述第一概率用于表示在所述第二索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的概率。
结合第一方面,在第一方面的第二种可能的实现方式中,所述备份服务器包括辅助存储器,所述第一指纹表以及所述第二指纹表存储在所述辅助存储器中;
所述根据所述第一索引指纹获得第一指纹表中包含有与所述待存储数据块的指纹相同的指纹的第一概率,并根据所述第二索引指纹获得所述第二指纹表中包含有与所述待存储数据块的指纹相同的指纹的第二概率,包括:
将所述待存储数据块的指纹以及所述第一索引指纹、所述第二索引指纹发送至所述辅助存储器;
接收所述辅助存储器返回的在所述第一索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的所述第一概率,以及在所述第二索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的所述第二概率。
结合第一方面、第一方面的第一种可能的实现方式以及第一方面的第二种可能的实现方式中的任一项,在第一方面的第三种可能的实现方式中,所述第一指纹表中的每个指纹包含M位,每个M位指纹包含N个区间,所述N个区间中的每个区间包括M位中连续的S位,所述N个区间中任意两个区间不重叠,所述N个区间的位数之和为M,N为大于或者等于2的自然数,S为自然数;
所述存储系统中存储有第一统计表,所述第一统计表包含有所述第一索引 指纹所代表的多个指纹在所述N个区间的数值的统计信息,所述第一概率的确定方式包括:
根据所述第一统计表确定ai在所述第一索引指纹所代表的多个指纹的所述第i区间的数值中的出现频次ti,其中,ai为待存储数据块的指纹的第i个区间的数值,i的取值范围为1至N;
根据获得的t1至tN中的最小值确定所述第一概率。
结合第一方面、第一方面的第一种可能的实现方式以及第一方面的第二种可能的实现方式中的任一项,在第一方面的第四种可能的实现方式中,所述存储系统中存储有第一统计表,所述第一统计表包含所述第一索引指纹所代表的多个指纹的第一区间的数值的统计信息,以及所述第一索引指纹所代表的多个指纹的第二区间的数值的统计信息,所述第一区间为各指纹的第h位至第i位的区间,所述第二区间为各指纹的第j位至第k位的区间,其中,h、i、j、k均为自然数,h的值不大于i的值,j的值不大于k的值,所述第一区间和所述第二区间不重叠;所述第一概率的确定方式包括:
根据所述第一统计表确定a在所述第一索引指纹所代表的多个指纹的所述第一区间的数值中的出现频次t1以及b在所述第一索引指纹所代表的多个指纹的所述第二区间的数值中出现的频次t2,其中,a为待存储数据块的指纹的第h位至第i位的数值,b为待存储数据块的指纹的第j位至第k位的数值;
根据所述t1和t2中的最小值确定所述第一概率。
第二方面,本发明实施例提供一种备份服务器,所述备份服务器应用于存储系统中,所述存储系统包括所述备份服务器以及多个存储器,所述存储系统中存储有多个指纹表,所述多个指纹表中记录有已存储于所述多个存储器中的数据块的指纹,所述备份服务器包括:
确定模块,用于根据指纹索引表中的索引指纹以及待存储数据块的指纹确定第一指纹集合,其中,所述第一指纹集合中包含有第一索引指纹、第二索引指纹,所述第一索引指纹用于代表第一指纹表中的多个指纹,所述第二索引指 纹用于代表第二指纹表中的多个指纹,所述待存储数据块的指纹属于所述第一索引指纹所代表的多个指纹以及所述第二索引指纹所代表的多个指纹的指纹范围;
获得模块,用于根据所述第一索引指纹获得第一指纹表中包含有与所述待存储数据块的指纹相同的指纹的第一概率,并根据所述第二索引指纹获得所述第二指纹表中包含有与所述待存储数据块的指纹相同的指纹的第二概率,其中,所述第一概率是根据所述第一索引指纹代表的多个指纹确定的,所述第二概率是根据所述第二索引指纹代表的多个指纹确定的;
所述确定模块,还用于根据所述第一概率和第二概率确定第二指纹集合,其中,所述第二指纹集合中至少包含有所述第一索引指纹,根据所述第一索引指纹确定的第一概率不小于预设阈值;
处理模块,用于获得所述第一索引指纹所代表的多个指纹与所述待存储数据块的指纹的匹配结果。
结合第二方面,在第二方面的第一种可能的实现方式中,所述第一指纹表存储在所述多个存储器中的第一存储器中,所述第二指纹表存储在所述多个存储器中的第二存储器中;
所述获得模块具体用于:将所述待存储数据块的指纹以及所述第一索引指纹发送至所述第一存储器;并接收所述第一存储器返回的所述第一概率,所述第一概率用于表示在所述第一索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的概率;以及
将所述待存储数据块的指纹以及所述第二索引指纹发送至所述第二存储器;并接收所述第二存储器返回的所述第二概率,所述第二概率用于表示在所述第二索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的概率。
结合第二方面,在第二方面的第二种可能的实现方式中,所述备份服务器还包括:
辅助存储器,用于存储第一指纹表以及所述第二指纹表;
所述获得模块具体用于:将所述待存储数据块的指纹以及所述第一索引指纹、所述第二索引指纹发送至所述辅助存储器;接收所述辅助存储器返回的在所述第一索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的所述第一概率,以及在所述第二索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的所述第二概率。
结合第二方面的第二种可能的实现方式,在第二方面的第三种可能的实现方式中,所述第一指纹表中的每个指纹包含M位,每个M位指纹包含N个区间,所述N个区间中的每个区间包括M位中连续的S位,所述N个区间中任意两个区间不重叠,所述N个区间的位数之和为M,N为大于或者等于2的自然数,S为自然数;所述辅助存储器中还用于存储第一统计表,所述第一统计表包含所述第一索引指纹所代表的多个指纹的所述N个区间的数值的统计信息;
所述辅助存储器还用于:根据所述第一统计表确定ai在所述第一索引指纹所代表的多个指纹的所述第i区间的数值中的出现频次ti,其中,ai为待存储数据块的指纹的第i区间的数值,i的取值范围为1至N,并根据所t1至tN中的最小值确定所述第一概率。
结合第二方面的第二种可能的实现方式,在第二方面的第四种可能的实现方式中,所述辅助存储器还用于存储第一统计表,所述第一统计表包含所述第一索引指纹所代表的多个指纹的第一区间的数值的统计信息,以及所述第一索引指纹所代表的多个指纹的第二区间的数值的统计信息,所述第一区间为各指纹的第h位至第i位的区间,所述第二区间为各指纹的第j位至第k位的区间,其中,h、i、j、k均为自然数,h的值不大于i的值,j的值不大于k的值,所述第一区间和所述第二区间不重叠;
所述辅助存储器还用于:根据所述第一统计表确定a在所述第一索引指纹所代表的多个指纹的所述第一区间的数值中的出现频次t1以及b在所述第一索 引指纹所代表的多个指纹的所述第二区间的数值中出现的频次t2,其中,a为待存储数据块的指纹的第h位至第i位的数值,b为待存储数据块的指纹的第j位至第k位的数值;并根据所述t1和t2中的最小值确定所述第一概率。
第三方面,本发明实施例提供一种存储系统,包括备份服务器以及多个存储器,所述存储系统中存储有多个指纹表,所述多个指纹表中记录有已存储于所述多个存储器中的数据块的指纹;
所述备份服务器用于:
根据指纹索引表中的索引指纹以及待存储数据块的指纹确定第一指纹集合,其中,所述第一指纹集合中包含有第一索引指纹、第二索引指纹,所述第一索引指纹用于代表第一指纹表中的多个指纹,所述第二索引指纹用于代表第二指纹表中的多个指纹,所述待存储数据块的指纹属于所述第一索引指纹所代表的多个指纹以及所述第二索引指纹所代表的多个指纹的指纹范围;
根据所述第一索引指纹获得第一指纹表中包含有与所述待存储数据块的指纹相同的指纹的第一概率,并根据所述第二索引指纹获得所述第二指纹表中包含有与所述待存储数据块的指纹相同的指纹的第二概率,其中,所述第一概率是根据所述第一索引指纹代表的多个指纹确定的,所述第二概率是根据所述第二索引指纹代表的多个指纹确定的;
根据所述第一概率和第二概率确定第二指纹集合,其中,所述第二指纹集合中至少包含有所述第一索引指纹,根据所述第一索引指纹确定的第一概率不小于预设阈值;
获得所述第一索引指纹所代表的多个指纹与所述待存储数据块的指纹的匹配结果。
结合第三方面,在第三方面的第一种可能的实现方式中,所述第一指纹表存储在所述多个存储器中的第一存储器中,所述第二指纹表存储在所述多个存储器中的第二存储器中;所述备份服务器具体用于:
将所述待存储数据块的指纹以及所述第一索引指纹发送至所述第一存储 器;以及接收所述第一存储器返回的所述第一概率,所述第一概率用于表示在所述第一索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的概率;
将所述待存储数据块的指纹以及所述第二索引指纹发送至所述第二存储器;以及接收所述第二存储器返回的所述第一概率,所述第一概率用于表示在所述第二索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的概率;
所述第一存储器具体用于:接收所述备份服务器发送的第一索引指纹以及所述待存储数据块的指纹,并确定在所述第一索引指纹代表的多个指纹中包含与所述待存储数据块的指纹相同的指纹的第一概率,并将所述第一概率发送至所述备份服务器;
所述第二存储器具体用于:接收所述备份服务器发送的第二索引指纹以及所述待存储数据块的指纹,并确定在所述第二索引指纹代表的多个指纹中包含与所述待存储数据块的指纹相同的指纹的第二概率,并将所述第二概率发送至所述备份服务器。
结合第三方面或第三方面的第一种可能的实现方式,在第三方面的第二种可能的实现方式中,所述第一指纹表中的每个指纹包含M位,每个M位指纹包含N个区间,所述N个区间中的每个区间包括M位中连续的S位,所述N个区间中任意两个区间不重叠,所述N个区间的位数之和为M,N为大于或者等于2的自然数,S为自然数;所述第一存储器上存储有第一统计表,所述第一统计表包含所述第一索引指纹所代表的多个指纹的所述N个区间的数值的统计信息;
所述第一存储器具体用于:根据所述第一统计表确定ai在所述第一索引指纹所代表的多个指纹的所述第i区间的数值中的出现频次ti,其中,ai为待存储数据块的指纹的第i区间的数值,i的取值范围为1至N;
根据所t1至tN中的最小值确定所述第一概率。
结合第三方面或第三方面的第一种可能的实现方式,在第三方面的第三种可能的实现方式中,所述第一存储器上存储有第一统计表,所述第一统计表包含所述第一索引指纹所代表的多个指纹的第一区间的数值的统计信息,以及所述第一索引指纹所代表的多个指纹的第二区间的数值的统计信息,所述第一区间为各指纹的第h位至第i位的区间,所述第二区间为各指纹的第j位至第k位的区间,其中,h、i、j、k均为自然数,h的值不大于i的值,j的值不大于k的值,所述第一区间和所述第二区间不重叠;
所述第一存储器具体用于:
根据所述第一统计表确定a在所述第一索引指纹所代表的多个指纹的所述第一区间的数值中的出现频次t1以及b在所述第一索引指纹所代表的多个指纹的所述第二区间的数值中出现的频次t2,其中,a为待存储数据块的指纹的第h位至第i位的数值,b为待存储数据块的指纹的第j位至第k位的数值;
根据所述t1和t2中的最小值确定所述第一概率。
本发明实施例中,备份服务器首先根据指纹索引表中的索引指纹以及待存储数据块的指纹确定出可能包含有待存储数据块的指纹的第一指纹集合。然后,备份服务器根据所述第一指纹集合中的第一索引指纹获得第一指纹表中包含有与所述待存储数据块的指纹相同的指纹的第一概率,并根据所述第一指纹集合中的第二索引指纹获得第二指纹表中包含有与所述待存储数据块的指纹相同的指纹的第二概率。再根据获得的第一概率和第二概率中大于预设阈值的概率确定第二指纹集合,并将所述第二指纹集合中的索引指纹所代表的多个指纹与所述待存储数据块的指纹进行匹配,以获得匹配结果。通过本发明实施例提供的数据处理方法,在指纹匹配过程中,可以只将待存储数据块的指纹与获得的第二指纹集合中的索引指纹所代表的多个指纹进行匹配,而无需将待存储数据块的指纹与指纹库中的所有数据的指纹进行匹配,减少了指纹比对过程中数据的搬运量,提高了数据处理的效率。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例的附图。
图1为本发明实施例提供的一种存储系统的结构示意图;
图2为本发明实施例提供的一种指纹表的示意图;
图3为本发明实施例提供的一种数据处理方法的流程示意图;
图4为本发明实施例提供的一种存储系统的结构示意框图;
图5为本发明实施例提供的一种指纹索引表的示意图;
图6为本发明实施例提供的一种指纹表的索引指纹的示意图;
图7为本发明实施例提供的又一种存储系统的结构示意框图;
图8为本发明实施例提供的一种统计表的实现方式的示意图;
图9为本发明实施例提供的又一种统计表的实现方式的示意图;
图10为本发明实施例提供的一种备份服务器的结构示意框图。
具体实施方式
针对存储系统在进行指纹比对时消耗大量I/O资源,导致数据存储的效率较低的问题,本发明实施例提供一种数据处理方法、备份服务器及存储系统。下面通过附图以及具体实施例对本发明技术方案做详细的说明,应当理解本发明实施例以及实施例中的具体特征是对本发明技术方案的详细的说明,而不是对本发明技术方案的限定,在不冲突的情况下,本发明实施例以及实施例中的技术特征可以相互组合。
为了便于理解本发明实施例提供的技术方案,下面首先介绍本发明实施例的一种应用场景。如图1所示,图1为本发明实施例提供的一种存储系统的结构示意图。存储系统10包括备份服务器11和多个存储器12,其中,存储器12用于存储数据,备份服务器11用于确定待存储数据块是否为重复数据,以 及调度存储器12对待存储数据块进行存储。存储系统10中存储有指纹(英文:Fingerprint;简称:FP)表,图2为指纹表的示意图,指纹表中保存有存储在存储器12的数据的指纹以及该数据的的存储位置。
存储系统10在接收到存储数据的请求后,备份服务器11将待存储数据块的指纹与指纹表中的指纹进行比对,如果在指纹表中检索到与待存储数据块的指纹相同的指纹,则表明待存储数据块为重复数据,存储系统不用再存储该数据块,只需更新其引用关系;反之,如果在指纹表中没有检索到与待存储数据块的指纹相同的指纹,则表明待存储数据块为新数据,存储系统将分配存储空间存储该数据块。
下面结合附图重点介绍本发明实施例提供的用于指纹比对的技术方案。
图3为本发明实施例提供的数据处理方法的流程示意图,该方法可以包括:
步骤101:根据指纹索引表中的索引指纹以及待存储数据块的指纹确定第一指纹集合,其中,第一指纹集合中包含有第一索引指纹、第二索引指纹,第一索引指纹用于代表第一指纹表中的多个指纹,第二索引指纹用于代表第二指纹表中的多个指纹,待存储数据块的指纹属于第一索引指纹所代表的多个指纹以及第二索引指纹所代表的多个指纹的指纹范围;
步骤102:根据第一索引指纹获得第一指纹表中包含有与待存储数据块的指纹相同的指纹的第一概率,并根据第二索引指纹获得第二指纹表中包含有与待存储数据块的指纹相同的指纹的第二概率,其中,第一概率是根据第一索引指纹代表的多个指纹确定的,第二概率是根据第二索引指纹代表的多个指纹确定的;
步骤103:根据第一概率和第二概率确定第二指纹集合,其中,第二指纹集合中至少包含有第一索引指纹,根据第一索引指纹确定的第一概率不小于预设阈值;
步骤104:获得第一索引指纹所代表的多个指纹与待存储数据块的指纹的匹配结果。
本发明实施例中,步骤101~步骤104对应的数据处理方法可以有至少两种实施方式,下面分别予以介绍。
实施方式1
图4为实施方式1对应的存储系统20的结构示意图,存储系统20包括:备份服务器21以及多个存储器22。在实施方式1中,步骤101~步骤104由备份服务器21来执行。
其中,备份服务器21包括:处理器211、内存212、以及辅助存储器213,辅助存储器213包括处理单元214,因此,辅助存储器213具有运算能力。存储系统中的指纹表存储在辅助存储器213中,实际情况中,辅助存储器213的数量可以为1个,也可以为2个或以上。备份服务器21中存储有指纹索引表,指纹索引表具体可以存放在内存212之中,也可以存放在辅助存储器213之中,或者是备份服务器21的其它存储单元之中。
图5为指纹索引表的示意图,指纹索引表中保存有每个指纹表的至少两个索引指纹,每个索引指纹代表一个指纹表中的多个指纹,每个指纹表的所有索引指纹所代表的指纹之和即为该指纹表所包含的所有指纹。图5中的Ti表示第i指纹表,Li表示指纹表的第i索引指纹,例如,FP14为第1指纹表的第2索引指纹,其代表第1指纹表中中的FP14至FP17间(不包括FP17)的指纹,具体为FP14、FP15、FP16
实际情况中,每个指纹表中指纹可以根据指纹大小进行排列,图6为指纹表1对应的索引指纹的示意图,图6中的指纹FP1至FP9依次增大,可以将指纹表1的指纹分为FP1~FP3、FP4~FP6、FP7~FP9这3个部分,以FP1、FP4、FP7作为该指纹表的3个索引指纹,索引指纹表中的FP1代表指纹表中的FP1~FP3,索引指纹表中的FP4代表指纹表中的FP4~FP6,索引指纹表中的FP7代表指纹表中的FP7~FP9。由于FP1至FP9的指纹大小依次增大,所以索引指纹表中的FP1、FP4、FP7各自代表的多个指纹的指纹范围不重叠。
本发明实施例中,从索引指纹表中可以确定出每个索引指纹所代表的多个 指纹的指纹范围,其实现方式可以为:其一,以指纹表中连续的多个指纹中的第一个指纹作为这部分指纹的索引指纹,可以根据指纹表对应的索引指纹中相邻的两个索引指纹确定出在前索引指纹代表的多个指纹的指纹范围,沿用图6的指纹表1,根据相连的索引指纹的FP1、FP4可以确定FP1代表的多个指纹的指纹范围为[FP1,FP4)。其二,索引指纹表中包含指纹范围属性,针对每个索引指纹保存其所代表的多个指纹的指纹范围。
步骤101中,备份服务器21的处理器211根据待存储数据块的指纹的大小,从指纹索引表确定出所代表的多个指纹的指纹范围包括待存储数据块的指纹的索引指纹,确定出的索引指纹的集合为第一指纹集合。
为了便于描述,本发明实施例中以第一指纹表、第二指纹表为例进行说明,但不能以此限定本发明实施例中存储系统中只包含有第一指纹表和第二指纹表,也不能限定本发明实施例中仅对指纹索引表中第一指纹表和第二指纹表对应的索引指纹进行步骤101的操作。实际情况中,针对每个指纹表对应的索引指纹均进行步骤101操作,由于每个指纹表的多个索引指纹各自对应的指纹范围不重叠,所以,针对每个指纹表,从其对应的多个索引指纹中至多可以确定出一个索引指纹。
然后,执行步骤102,备份服务器21确定出每个指纹表中包含与待存储数据块相同指纹的概率,由于步骤101中已经从每个指纹表中确定出对应的指纹范围包括待存储数据块的索引指纹,则该指纹表的其它索引指纹所代表的指纹中一定不包含待存储数据块的指纹,因此,第一指纹表中包含与待存储数据块相同指纹的概率,实质为步骤101中确定出的第一索引指纹代表的多个指纹中包含与待存储数据块相同指纹的概率,同理,第二指纹表中包含与待存储数据块相同指纹的概率,实质为步骤101中确定出的第二索引指纹代表的多个指纹中包含与待存储数据块相同指纹的概率。
具体实施时,处理器211将待存储数据块的指纹以及第一指纹集合中的索引指纹发送至辅助存储器213,由辅助存储器213通过其处理单元214确定从 第一指纹集合中的每个索引指纹所代表的多个指纹中检索到待存储数据块的指纹的概率,然后,辅助存储器213将确定出的概率值返回给处理器211。其中,辅助存储器213具体可以根据接收的索引指纹所代表的多个指纹的统计信息(如指纹在待存储数据块的指纹的数值附近的分布信息、指纹的值在各指纹维度下的频次统计信息,等等)来确定在索引指纹所代表的多个指纹中包含与待存储数据块的指纹相同的指纹的概率。
在备份服务器21包括两个或以上的辅助存储器213时,处理器可以只向辅助存储器213发送第一指纹集合中与该辅助存储器保存的指纹表相关的索引指纹。例如,备份服务器21包括第一辅助存储器和第二辅助存储器,其中,第一辅助存储器保存有第一指纹表,第二辅助存储器保存有第二指纹表,处理器211将待存储数据块的指纹以及第一索引指纹发送至第一辅助存储器,将待存储数据块的指纹以及第二索引指纹发送至第二辅助存储器。第一辅助存储器通过其处理单元确定出第一概率,并返回给处理器211;第二辅助存储器通过其处理单元确定出第二概率,并返回给处理器211。
然后,执行步骤103,备份服务器21根据接收到的从每个指纹表中检索到待存储数据块的指纹的概率,确定出第二指纹集合,第二指纹集合中包含第一指纹集合中满足预设条件的索引指纹,该预设条件为:在索引指纹代表的多个指纹中检索到待存储数据块的指纹的概率(即:步骤102中确定出的,存储该索引指纹代表的多个指纹的指纹表中包括与待存储数据块的指纹的指纹相同的指纹的概率)大于预设阈值。其中,预设阈值的值可以为0,即从第一指纹集合中剔除一部分索引指纹形成第二指纹集合,被剔除的索引指纹为在其代表的多个指纹中检索到待存储数据块的概率为0的这部分索引指纹。
例如,不妨设辅助存储器213返回给处理器211的在第二索引指纹代表的多个指纹中检索到待存储数据块的指纹的第二概率小于预设阈值,而在第一索引指纹代表的多个指纹中检索到待存储数据块的指纹的第一概率大于预设阈值,则第一索引指纹包含在第二指纹集合之中,而第二索引指纹没有包含在第 二指纹集合之中。
然后,执行步骤104,备份服务器21在第二指纹集合中的索引指纹代表的多个指纹中进行指纹比对,获得指纹比对结果,以此确定待存储数据块是否为重复数据。具体实施时,包括两种实现方式:其一,处理器211将辅助存储器213中保存的指纹表中的第二指纹集合中的索引指纹所代表的多个指纹读取到内存212中,然后在内存212中进行指纹的比对。其二,由辅助存储器213通过自身包含的处理单元214完成指纹的比对,即辅助存储器213通过自身的处理单元214将自身存储的指纹表中的、第二指纹集合中的索引指纹所代表的多个指纹读取到辅助存储器213的缓冲器中,进行指纹比对。其中,辅助存储器213的缓冲器可以是随机存储器(英语:Random Access Memory;简称:RAM),也可以为高速缓存(Cache)。由于通过辅助存储器213进行指纹比对时,无需进行数据的外部搬运,能够减少因数据搬运而产生的耗时。
实际情况中,步骤103中的预设阈值也可以为大于0的概率值,备份服务器21首先在根据大于0的预设阈值确定出的第二指纹集合中的索引指纹所代表的多个指纹中进行指纹比对,如果比对结果无法确认待存储数据块是否为重复数据,则备份服务器21在对应的概率值在(0,预设阈值)之间的索引指纹所代表的多个指纹中进行指纹比对。即,备份服务器先在概率较大的索引指纹代表的多个指纹中进行指纹比对,在比对结果无法确认待存储数据块的重复性时,再在概率较小的索引指纹代表的多个指纹中进行指纹比对,能够有效减少指纹比对的耗时。
指纹比对的另一种实现方式为:步骤103中预设阈值的值为0,确定对应的概率值大于0的索引指纹为第二指纹集合的元素,然后,备份服务器21可以对第二指纹集合中的索引指纹按其对应的概率值进行排序,处理器211在将第二指纹集合中的索引指纹代表的多个指纹读取到内存进行指纹检索时,根据索引指纹的概率值排序确定指纹读取的顺序。即,首先读取对应的概率值最大的索引指纹所代表的多个指纹,在无法根据这部分指纹确定待存储指纹是否为 重复数据后,再将概率值排在第二位的索引指纹代表的多个指纹读取到内存进行指纹比对。以此类推,直至检索到与待存储数据块的指纹相同的指纹;或者,确定在所有的概率值大于0的索引指纹所代表的多个指纹中均没有包含与待存储数据块的指纹相同的指纹。前一种情况表明待存储数据块为重复数据,后一种情况表明待存储数据块为新数据。
上述技术方案中,备份服务器21通过指纹索引表在每个指纹表中确定出一个索引指纹代表的多个指纹,该索引指纹代表的多个指纹的指纹范围包括待存储数据块的指纹的多个指纹,然后基于从每个指纹表中确定出的多个指纹进行下一步的指纹比对操作,减少了指纹搬运的工作量。不仅如此,针对每个指纹表计算在其(确定出的多个指纹)中检索到待存储数据块的指纹的概率,然后在概率不小于预设阈值的指纹表的确定出的多个指纹中进行指纹比对,减少指纹比对时指纹的搬运量,减少指纹比对的耗时,提高存储数据的效率。
实施方式2
图7为实施方式2对应的存储系统30的结构示意图,存储系统30包括:备份服务器31以及多个存储器32。在实施方式2中,步骤101~步骤104由备份服务器31来执行。
其中,存储器32用于存储数据块以及指纹表,指纹表的形式可以参照图5,存储器32存储的指纹表可以由存储在该存储器上的数据块对应的指纹信息所形成,存储器32存储的指纹表也可以与存储器32自身存储的数据块无关的其他数据块的指纹信息。备份服务器31包括处理器311以及内存312。备份服务器31中存储有指纹索引表,指纹索引表具体可以存放在内存312之中,也可以存放在备份服务器31的其它存储单元之中。
步骤101中,处理器311确定第一指纹集合的方式与前述处理器211确定第一指纹集合的方式相同,本发明实施例不予重复。
然后,备份服务器31执行步骤102,获得在每个指纹表(对应的属于第一指纹集合的索引指纹代表的多个指纹)中检索到待存储数据块的指纹的概率。 具体实现方式包括:其一,存储有指纹表的存储器32包含处理单元321,与备份服务器21获得概率值类似,备份服务器31可以将待存储数据块的指纹以及第一指纹集合中的索引指纹发送至存储器32,存储器32通过自身的处理单元321以及自身保存的指纹统计信息确定出上述概率值,然后将该概率值发送至备份服务器31。其二,处理器311将存储在存储器32上的指纹的统计信息读取到内存312中,处理器311自己根据存入内存312的统计信息确定上述概率值。
然后,备份服务器31执行步骤103,其实现方式与前述备份服务器21执行步骤103一致。
然后,备份服务器31执行步骤104,获得在第二指纹集合的索引指纹代表的多个指纹中对待存储数据块的指纹进行比对的结果。与前述备份服务器21执行步骤104类似,备份服务器31可以通过处理器311将存储在存储器32上的第二指纹集合中的索引指纹代表的多个指纹读取到内存中进行指纹比对,即指纹比对的工作由备份服务器31自己完成。此外,备份服务器31获得指纹比对结果的另一种方式为:备份服务器31将第二指纹集合中的索引指纹以及待存储数据块的指纹发送给存储器32,由存储器32通过自身的处理单元321在存储器32本地进行指纹比对,这种方式能够减少数据的搬运,而且多个存储器32之间能够以并行的方式进行指纹比对,能够提高指纹比对的效率。
根据步骤103~步骤104进行指纹比对的一种实现方式为:步骤103中预设阈值的值为0,确定对应的概率值大于0的索引指纹为第二指纹集合的元素,然后,备份服务器31可以对第二指纹集合中的索引指纹按其对应的概率值进行排序,处理器311在将第二指纹集合中的索引指纹代表的多个指纹读取到内存进行指纹检索时,根据索引指纹的概率值排序确定指纹读取的顺序。即,首先读取对应的概率值最大的索引指纹所代表的多个指纹,在无法根据这部分指纹确定待存储指纹是否为重复数据后,再将概率值排在第二位的索引指纹代表的多个指纹读取到内存进行指纹比对。以此类推,直至检索到与待存储数据块 的指纹相同的指纹;或者,确定在所有的概率值大于0的索引指纹所代表的多个指纹中均没有包含与待存储数据块的指纹相同的指纹。前一种情况表明待存储数据块为重复数据,后一种情况表明待存储数据块为新数据。
上述步骤101~步骤104的两种实现方式中,在指纹匹配过程中,可以只将待存储数据块的指纹与获得的第二指纹集合中的索引指纹所代表的多个指纹进行匹配,而无需将待存储数据块的指纹与指纹库中的所有数据的指纹进行匹配,减少了指纹比对过程中数据的搬运量,提高了数据处理的效率。
可选的,在存储系统20中的备份服务器21包含2个或以上的辅助存储器213时,指纹索引表中还可以包含第一索引指纹所属的第一指纹表的位置信息,即第一指纹表保存在哪个辅助存储器中的信息,备份服务器21执行步骤102时,处理器211只需根据第一索引指纹对应的位置信息定位出存储第一指纹表的辅助存储器,对应将第一索引指纹和待存储数据块的指纹发送至该辅助存储器,以使该辅助存储器确定在第一指纹表中包括待存储数据块的指纹相同的指纹的概率。
可选的,在存储系统30中,指纹索引表中还可以包含第一索引指纹所属的第一指纹表的位置信息,即保存有第一指纹表的第一存储器的标识,备份服务器31执行步骤102时,处理器311只需根据第一索引指纹对应的位置信息定位出第一存储器,对应将第一索引指纹和待存储数据块的指纹发送至第一存储器,以使第一存储器确定在第一指纹表中包括待存储数据块的指纹相同的指纹的概率。
在一种情况下,第一指纹表存储在多个存储器中的第一存储器中,第二指纹表存储在多个存储器中的第二存储器中。
步骤102中,根据第一索引指纹获得第一指纹表中包含有与待存储数据块的指纹相同的指纹的第一概率,实施时包括如下步骤:
将待存储数据块的指纹以及第一索引指纹发送至第一存储器;
接收第一存储器返回的第一概率,第一概率用于表示在第一索引指纹所代 表的多个指纹中包含有与待存储数据块的指纹相同的指纹的概率。
在步骤102中,根据第二索引指纹获得第二指纹表中包含有与待存储数据块的指纹相同的指纹的第二概率,实施时包括如下步骤:将待存储数据块的指纹以及第二索引指纹发送至第二存储器;接收第二存储器返回的第二概率,第二概率用于表示在第二索引指纹所代表的多个指纹中包含有与待存储数据块的指纹相同的指纹的概率。
具体的,上述方式对应前述实施方式2中备份服务器31执行步骤102的情形,第一存储器以及第二存储器为两个存储器32,能够根据自身包含的处理单元321进行指纹比对。其具体实施方式在实施方式2中已经有详细描述,在此不再重复。
在另一种情况下,备份服务器包括辅助存储器,第一指纹表以及第二指纹表存储在辅助存储器中。
在步骤102中:根据第一索引指纹获得第一指纹表中包含有与待存储数据块的指纹相同的指纹的第一概率,并根据第二索引指纹获得第二指纹表中包含有与待存储数据块的指纹相同的指纹的第二概率,实施时包括如下步骤:
将待存储数据块的指纹以及第一索引指纹、第二索引指纹发送至辅助存储器;接收辅助存储器返回的在第一索引指纹所代表的多个指纹中包含有与待存储数据块的指纹相同的指纹的第一概率,以及在第二索引指纹所代表的多个指纹中包含有与待存储数据块的指纹相同的指纹的第二概率。
具体的,上述方式对应前述实施方式1中备份服务器21执行步骤102的情形,本实施例中的辅助存储器即为实施方式1中的辅助存储器213,能够根据自身包含的处理单元214进行指纹比对。其具体实施方式在实施方式1中已经有详细描述,在此不再重复。
可选的,本发明实施例中,第一指纹表中的每个指纹包含M位,每个M位指纹包含N个区间,N个区间中的每个区间包括M位中连续的S位,N个区间中任意两个区间不重叠,N个区间的位数之和为M,N为大于或者等于2 的自然数,S为自然数。通过上述设定,可以将指纹表中的每个指纹划分为N个区间,每个区间相当于一个指纹维度。例如,64比特(bit)的指纹可以划分为4维16bit的数值组成,即1bit~16bit为第一维,17bit~32bit为第二维,33bit~48bit为第三维,49bit~64bit为第四维。实际情况中,不限定每一维数值所占的比特数,也不限定所有维度所占的比特数均相同。
存储系统中存储有第一统计表,第一统计表包含有第一索引指纹所代表的多个指纹在N个区间的数值的统计信息。图8为一种第一统计表的示意图,不妨设第一索引指纹所代表的3个指纹分别为01020504H、01030504H、02030102H,每个指纹划分为4维,以01025004H为例,其四个指纹维度的值分别为1、2、5、4。第一统计表记录了每个维度中可能的数值在第一索引指纹所代表的多个指纹中相应维度中的出现频次,例如,数值“1”在第一维中出现的频次2,数值“2”在第一维中出现的频次1,数值“3”、“4”、“5”在第一维中出现的频次均为0。
第一概率的确定方式包括:根据第一统计表确定ai在第一索引指纹所代表的多个指纹的第i区间的数值中的出现频次ti,其中,ai为待存储数据块的指纹的第i个区间的数值,i的取值范围为1至N;根据获得的t1至tN中的最小值确定第一概率。
具体的,在前述存储系统20中,第一统计表可以存储在保存第一指纹表的辅助存储器213上,第一概率由该辅助存储器213通过自身的处理单元214所确定。例如,不妨设待存储数据块的指纹为01020404H,该指纹在四个指纹维度中的数值分别为1、2、4、4。计算在图5所示的表块中检索到该待检索指纹的概率指标的方式为:在表中第一维度中查找数值1的频次为2,在第二维度查找数值2的频次为1,在第三维度查找数值4的频次为0,在第四维度中查找数值4的频次为2,则概率指标为频次中的最小值0。
在前述存储系统30中,第一统计表可以存储在保存第一指纹表的存储器32中,第一概率由该存储器通过其处理单元321所确定,具体确定方法与上述 处理单元214根据第一统计表确定第一概率的方式相同。
实际情况中,第二概率也可以采用上述方式,利用指纹的统计表来进行确定,本发明实施例在此不再重复。
上述技术方案中,通过第一统计表来确定在第一索引指纹代表的多个指纹中检索到待存储数据块的指纹的第一概率,其实现方式简单,运算量小,耗时较少,且结果准确。
可选的,作为另一实施例,存储系统中存储有第一统计表,图9为另一种第一统计表的示意图,第一统计表包含第一索引指纹所代表的多个指纹的第一区间的数值的统计信息,以及第一索引指纹所代表的多个指纹的第二区间的数值的统计信息,第一区间为各指纹的第h位至第i位的区间,第二区间为各指纹的第j位至第k位的区间,其中,h、i、j、k均为自然数,h的值不大于i的值,j的值不大于k的值,第一区间和第二区间不重叠。
第一概率的确定方式包括:根据第一统计表确定a在第一索引指纹所代表的多个指纹的第一区间的数值中的出现频次t1以及b在第一索引指纹所代表的多个指纹的第二区间的数值中出现的频次t2,其中,a为待存储数据块的指纹的第h位至第i位的数值,b为待存储数据块的指纹的第j位至第k位的数值;根据t1和t2中的最小值确定第一概率。
实际情况中,确定第一概率的作用主要为从第一指纹集合中排除一部分索引指纹,这些被排除的索引指纹代表的多个指纹中包含与待存储数据块的指纹相同的指纹的概率为0。本发明实施例中,参照图9,第一统计表中只包含指纹的部分维度的统计信息,沿用第一索引指纹代表01020504H、01030504H、02030102H这3个指纹的例子,第一统计表中可以只保存第一维以及第三维的统计信息,不妨设待存储数据块的指纹为01020404H,其第三维为“4”,根据图9的第一统计表可以确定在第三维出现“4”的频次为0,表明在第一索引指纹代表的多个指纹中不包含待存储数据块的指纹。同理,本实施例中的第一统计表可以保存在存储系统20的辅助存储器213之中,则第一概率由辅助存 储器213根据其自身的处理单元214所确定。第一统计表也可以保存在存储系统30中的存储器32之中,则第一概率由存储器32自身的处理单元321所确定。
上述技术方案中,通过指纹的部分维度的统计信息来确定在第一索引指纹代表的多个指纹中检索到待存储数据块的指纹的第一概率,以从第一指纹集合中剔除部分对应的概率值为0的索引指纹,减少指纹比对时的数据搬运量,而且其实现方式简单,运算量小,耗时较少。
可选的,作为另一实施例,在第一指纹集合为空集合时,备份服务器21或备份服务器31确定待存储数据块为新数据。
可选的,作为另一实施例,在预设阈值为0时,在第二指纹集合为空集合时,备份服务器21或备份服务器31确定待存储数据块为新数据。
可选的,作为另一实施例,备份服务器在执行步骤101之前,还包括如下步骤:对待存储数据块的指纹进行指纹过滤,并确定通过指纹过滤无法判断待存储数据块的指纹是否为重复指纹。
具体的,备份服务器在从指纹表中检索待存储数据块的指纹之前,可以根据指纹过滤技术对待检索指纹进行预判,预判的结果包括三种,其一,确定指纹表中存在待存储数据块的指纹,待存储数据块为重复数据;其二,确定指纹表中不包含待存储数据块的指纹,待存储数据块为新数据;其三,无法断言指纹表中是否包含待存储数据块的指纹,只有在这种情况下,备份服务器才执行步骤101~步骤104。
具体实施时,可以采用布隆过滤器(Bloom Filter),或者局部性保持(英文:Locality Preserved Caching;简称:LPC)技术等指纹过滤技术,也可以采用两种或以上的指纹过滤技术的结合,例如,先采用布隆过滤器对指纹进行过滤,如果布隆过滤器无法判断待存储数据块的指纹是否为重复指纹,则进一步采用LPC技术进行过滤。指纹过滤技术的具体实现方式请参照现有技术,本发明实施例不予详述。
上述技术方案中备份服务器先通过指纹过滤技术对待存储数据块的指纹进行预判,只有在不能预判待检索指纹的情况下,才执行步骤101~步骤104。通过指纹过滤技术能够大幅缩短部分指纹的比对耗时,提高备份服务器的性能。
可选的,在存储系统20中,指纹表的维护方式可以为:在内存212中实时创建新的指纹表,当确定出当前存储的指纹表中不包含待存储数据块的指纹时,确定待存储数据块为新数据,并将其指纹添加到内存实施创建的指纹表中,当内存中的指纹表的指纹数达到设定值后,将该指纹表存到辅助存储器213上。另外,将该指纹表对应的每个索引指纹添加到索引指纹表之中。再者,将该指纹表对应的每个索引指纹对应的多个指纹的维度值的统计表存到辅助存储器213上。
在采用上述方式维护指纹表时,备份服务器在进行指纹比对时,首先在内存中在创建的指纹表中进行比对,比对结果不能确定待存储数据块是否为重复指纹时,才执行步骤101~步骤104,在存储在内存之外的指纹表中进行指纹比对。
可选的,在存储系统30中,指纹表的维护方式可以为:在确定待存储数据块为新数据时,在保存该数据块的存储器32的指纹表中添加该数据块的指纹,然后,更新该指纹表的统计表。
需要说明的是,以上处理器211、处理器311、处理单元214以及处理单元321,可以是一个独立的处理器,也可以是多个处理元件的统称。例如,处理器211、处理器311、处理单元214以及处理单元321可以是中央处理器(英文:Central Processing Unit;简称:CPU),也可以是特定集成电路(英文:Application Specific Intergrated Circuit;简称:ASIC),或者是被配置成实施本发明实施例的一个或多个集成电路,例如:一个或多个微处理器(英文:digital singnal processor;简称:DSP),或,一个或者多个现场可编程门阵列(英文:Field Programmable Gate Array;简称:FPGA)。
基于相同的发明构思,本发明实施例提供一种备份服务器40,应用于存储系统中,该存储系统包括备份服务器以及多个存储器,存储系统中存储有多个指纹表,多个指纹表中记录有已存储于多个存储器中的数据块的指纹。图10为备份服务器40的结构示意框图,备份服务器40包括:
确定模块41,用于根据指纹索引表中的索引指纹以及待存储数据块的指纹确定第一指纹集合,其中,第一指纹集合中包含有第一索引指纹、第二索引指纹,第一索引指纹用于代表第一指纹表中的多个指纹,第二索引指纹用于代表第二指纹表中的多个指纹,待存储数据块的指纹属于第一索引指纹所代表的多个指纹以及第二索引指纹所代表的多个指纹的指纹范围;
获得模块42,用于根据第一索引指纹获得第一指纹表中包含有与待存储数据块的指纹相同的指纹的第一概率,并根据第二索引指纹获得第二指纹表中包含有与待存储数据块的指纹相同的指纹的第二概率,其中,第一概率是根据第一索引指纹代表的多个指纹确定的,第二概率是根据第二索引指纹代表的多个指纹确定的;
确定模块41,还用于根据第一概率和第二概率确定第二指纹集合,其中,第二指纹集合中至少包含有第一索引指纹,根据第一索引指纹确定的第一概率不小于预设阈值;
处理模块43,用于获得第一索引指纹所代表的多个指纹与待存储数据块的指纹的匹配结果。
可选的,本发明实施例中,第一指纹表存储在多个存储器中的第一存储器中,第二指纹表存储在多个存储器中的第二存储器中;
获得模块42具体用于:将待存储数据块的指纹以及第一索引指纹发送至第一存储器;并接收第一存储器返回的第一概率,第一概率用于表示在第一索引指纹所代表的多个指纹中包含有与待存储数据块的指纹相同的指纹的概率;以及
将待存储数据块的指纹以及第二索引指纹发送至第二存储器;并接收第二 存储器返回的第二概率,第二概率用于表示在第二索引指纹所代表的多个指纹中包含有与待存储数据块的指纹相同的指纹的概率。
可选的,本发明实施例中,备份服务器40还包括:
辅助存储器,用于存储第一指纹表以及第二指纹表;
获得模块42具体用于:将待存储数据块的指纹以及第一索引指纹、第二索引指纹发送至辅助存储器;接收辅助存储器返回的在第一索引指纹所代表的多个指纹中包含有与待存储数据块的指纹相同的指纹的第一概率,以及在第二索引指纹所代表的多个指纹中包含有与待存储数据块的指纹相同的指纹的第二概率。
可选的,本发明实施例中,辅助存储器还用于存储第一统计表,第一统计表包含第一索引指纹所代表的多个指纹的第一区间的数值的统计信息,以及第一索引指纹所代表的多个指纹的第二区间的数值的统计信息,第一区间为各指纹的第h位至第i位的区间,第二区间为各指纹的第j位至第k位的区间,其中,h、i、j、k均为自然数,h的值不大于i的值,j的值不大于k的值,第一区间和第二区间不重叠;
辅助存储器还用于:根据第一统计表确定a在第一索引指纹所代表的多个指纹的第一区间的数值中的出现频次t1以及b在第一索引指纹所代表的多个指纹的第二区间的数值中出现的频次t2,其中,a为待存储数据块的指纹的第h位至第i位的数值,b为待存储数据块的指纹的第j位至第k位的数值;并根据t1和t2中的最小值确定第一概率。
可选的,本发明实施例中,第一指纹表中的每个指纹包含M位,每个M位指纹包含N个区间,N个区间中的每个区间包括M位中连续的S位,N个区间中任意两个区间不重叠,N个区间的位数之和为M,N为大于或者等于2的自然数,S为自然数;辅助存储器中还用于存储第一统计表,第一统计表包含第一索引指纹所代表的多个指纹的N个区间的数值的统计信息;
辅助存储器还用于:根据第一统计表确定ai在第一索引指纹所代表的多个 指纹的第i区间的数值中的出现频次ti,其中,ai为待存储数据块的指纹的第i区间的数值,i的取值范围为1至N,并根据所t1至tN中的最小值确定第一概率。
本实施例中的备份服务器40与图3对应的数据处理方法是基于同一发明构思下的两个方面,在前面已经对方法的实施过程作了详细的描述,所以本领域技术人员可根据前述描述清楚地了解本实施例中的备份服务器的40结构及实施过程,为了说明书的简洁,在此就不再赘述了。
基于相同的发明构思,本发明实施例中提供一种存储系统,包括备份服务器以及多个存储器,存储系统中存储有多个指纹表,多个指纹表中记录有已存储于多个存储器中的数据块的指纹。
该备份服务器用于:
根据指纹索引表中的索引指纹以及待存储数据块的指纹确定第一指纹集合,其中,第一指纹集合中包含有第一索引指纹、第二索引指纹,第一索引指纹用于代表第一指纹表中的多个指纹,第二索引指纹用于代表第二指纹表中的多个指纹,待存储数据块的指纹属于第一索引指纹所代表的多个指纹以及第二索引指纹所代表的多个指纹的指纹范围;
根据第一索引指纹获得第一指纹表中包含有与待存储数据块的指纹相同的指纹的第一概率,并根据第二索引指纹获得第二指纹表中包含有与待存储数据块的指纹相同的指纹的第二概率,其中,第一概率是根据第一索引指纹代表的多个指纹确定的,第二概率是根据第二索引指纹代表的多个指纹确定的;
根据第一概率和第二概率确定第二指纹集合,其中,第二指纹集合中至少包含有第一索引指纹,根据第一索引指纹确定的第一概率不小于预设阈值;
获得第一索引指纹所代表的多个指纹与待存储数据块的指纹的匹配结果。
可选的,本发明实施例中,第一指纹表存储在多个存储器中的第一存储器中,第二指纹表存储在多个存储器中的第二存储器中;备份服务器具体用于:
将待存储数据块的指纹以及第一索引指纹发送至第一存储器;以及接收第 一存储器返回的第一概率,第一概率用于表示在第一索引指纹所代表的多个指纹中包含有与待存储数据块的指纹相同的指纹的概率;
将待存储数据块的指纹以及第二索引指纹发送至第二存储器;以及接收第二存储器返回的第一概率,第一概率用于表示在第二索引指纹所代表的多个指纹中包含有与待存储数据块的指纹相同的指纹的概率;
第一存储器具体用于:接收备份服务器发送的第一索引指纹以及待存储数据块的指纹,并确定在第一索引指纹代表的多个指纹中包含与待存储数据块的指纹相同的指纹的第一概率,并将第一概率发送至备份服务器;
第二存储器具体用于:接收备份服务器发送的第二索引指纹以及待存储数据块的指纹,并确定在第二索引指纹代表的多个指纹中包含与待存储数据块的指纹相同的指纹的第二概率,并将第二概率发送至备份服务器。
可选的,本发明实施例中,第一存储器上存储有第一统计表,第一统计表包含第一索引指纹所代表的多个指纹的第一区间的数值的统计信息,以及第一索引指纹所代表的多个指纹的第二区间的数值的统计信息,第一区间为各指纹的第h位至第i位的区间,第二区间为各指纹的第j位至第k位的区间,其中,h、i、j、k均为自然数,h的值不大于i的值,j的值不大于k的值,第一区间和第二区间不重叠;
第一存储器具体用于:
根据第一统计表确定a在第一索引指纹所代表的多个指纹的第一区间的数值中的出现频次t1以及b在第一索引指纹所代表的多个指纹的第二区间的数值中出现的频次t2,其中,a为待存储数据块的指纹的第h位至第i位的数值,b为待存储数据块的指纹的第j位至第k位的数值;
根据t1和t2中的最小值确定第一概率。
可选的,本发明实施例中,第一指纹表中的每个指纹包含M位,每个M位指纹包含N个区间,N个区间中的每个区间包括M位中连续的S位,N个区间中任意两个区间不重叠,N个区间的位数之和为M,N为大于或者等于2 的自然数,S为自然数;第一存储器上存储有第一统计表,第一统计表包含第一索引指纹所代表的多个指纹的N个区间的数值的统计信息;
第一存储器具体用于:根据第一统计表确定ai在第一索引指纹所代表的多个指纹的第i区间的数值中的出现频次ti,其中,ai为待存储数据块的指纹的第i区间的数值,i的取值范围为1至N;
根据所t1至tN中的最小值确定第一概率。
本实施例中的存储系统与图3对应的数据处理方法是基于同一发明构思下的两个方面,在前面已经对方法的实施过程作了详细的描述,所以本领域技术人员可根据前述描述清楚地了解本实施例中的存储系统的结构及实施过程,为了说明书的简洁,在此就不再赘述了。
本发明实施例中提供的一个或多个技术方案,至少具有如下技术效果或优点:
上述技术方案中,备份服务器通过指纹索引表在每个指纹表中确定出一个索引指纹代表的多个指纹,该索引指纹代表的多个指纹的指纹范围包括待存储数据块的指纹,然后基于从每个指纹表中确定出的多个指纹进行下一步的指纹比对操作,减少了指纹搬运的工作量。不仅如此,针对每个指纹表计算在其(确定出的多个指纹)中检索到待存储数据块的指纹的概率,然后在概率不小于预设阈值的指纹表的确定出的多个指纹中进行指纹比对,减少指纹比对时指纹的搬运量,减少指纹比对的耗时,亦即减少确定待存储数据块是否为重复数据块的时间,提高存储数据的效率。
本领域内的技术人员应明白,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例例如能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方 法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
本发明实施例还提供一种数据处理的计算机程序产品,包括存储了程序代码的计算机可读存储介质,所述程序代码包括的指令用于执行前述任意一个方法实施例所述的方法流程。本领域普通技术人员可以理解,前述的存储介质包括:U盘、移动硬盘、磁碟、光盘、随机存储器(Random-Access Memory,RAM)、固态硬盘(Solid State Disk,SSD)或者非易失性存储器(non-volatile memory)等各种可以存储程序代码的非短暂性的(non-transitory)机器可读介质。
需要说明的是,本申请所提供的实施例仅仅是示意性的。所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。在本发明实施例、权利要求以及附图中揭示的特征可以独立存在也可以组合存在。在本发明实施例中以硬件形式描述的特征可以通过软件来执行,反之亦然。在此不做限定。

Claims (14)

  1. 一种数据处理方法,其特征在于,所述方法由存储系统中的备份服务器执行,所述存储系统中包括所述备份服务器以及多个存储器,所述存储系统中存储有多个指纹表,所述多个指纹表中记录有已存储于所述多个存储器中的数据块的指纹,所述方法包括:
    根据指纹索引表中的索引指纹以及待存储数据块的指纹确定第一指纹集合,其中,所述第一指纹集合中包含有第一索引指纹、第二索引指纹,所述第一索引指纹用于代表第一指纹表中的多个指纹,所述第二索引指纹用于代表第二指纹表中的多个指纹,所述待存储数据块的指纹属于所述第一索引指纹所代表的多个指纹以及所述第二索引指纹所代表的多个指纹的指纹范围;
    根据所述第一索引指纹获得第一指纹表中包含有与所述待存储数据块的指纹相同的指纹的第一概率,并根据所述第二索引指纹获得所述第二指纹表中包含有与所述待存储数据块的指纹相同的指纹的第二概率,其中,所述第一概率是根据所述第一索引指纹代表的多个指纹确定的,所述第二概率是根据所述第二索引指纹代表的多个指纹确定的;
    根据所述第一概率和第二概率确定第二指纹集合,其中,所述第二指纹集合中至少包含有所述第一索引指纹,根据所述第一索引指纹确定的第一概率不小于预设阈值;
    获得所述第一索引指纹所代表的多个指纹与所述待存储数据块的指纹的匹配结果。
  2. 如权利要求1所述的方法,其特征在于,所述第一指纹表存储在所述多个存储器中的第一存储器中,所述第二指纹表存储在所述多个存储器中的第二存储器中;所述根据所述第一索引指纹获得第一指纹表中包含有与所述待存储数据块的指纹相同的指纹的第一概率,包括:
    将所述待存储数据块的指纹以及所述第一索引指纹发送至所述第一存储器;
    接收所述第一存储器返回的所述第一概率,所述第一概率用于表示在所述第一索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的概率;
    所述根据所述第二索引指纹获得所述第二指纹表中包含有与所述待存储数据块的指纹相同的指纹的第二概率,包括:
    将所述待存储数据块的指纹以及所述第二索引指纹发送至所述第二存储器;
    接收所述第二存储器返回的所述第一概率,所述第一概率用于表示在所述第二索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的概率。
  3. 如权利要求1所述的方法,其特征在于,所述备份服务器包括辅助存储器,所述第一指纹表以及所述第二指纹表存储在所述辅助存储器中;
    所述根据所述第一索引指纹获得第一指纹表中包含有与所述待存储数据块的指纹相同的指纹的第一概率,并根据所述第二索引指纹获得所述第二指纹表中包含有与所述待存储数据块的指纹相同的指纹的第二概率,包括:
    将所述待存储数据块的指纹以及所述第一索引指纹、所述第二索引指纹发送至所述辅助存储器;
    接收所述辅助存储器返回的在所述第一索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的所述第一概率,以及在所述第二索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的所述第二概率。
  4. 如权利要求1至3任一项所述的方法,其特征在于,所述第一指纹表中的每个指纹包含M位,每个M位指纹包含N个区间,所述N个区间中的每个区间包括M位中连续的S位,所述N个区间中任意两个区间不重叠,所述N个区间的位数之和为M,N为大于或者等于2的自然数,S为自然数;
    所述存储系统中存储有第一统计表,所述第一统计表包含有所述第一索引 指纹所代表的多个指纹在所述N个区间的数值的统计信息,所述第一概率的确定方式包括:
    根据所述第一统计表确定ai在所述第一索引指纹所代表的多个指纹的所述第i区间的数值中的出现频次ti,其中,ai为待存储数据块的指纹的第i个区间的数值,i的取值范围为1至N;
    根据获得的t1至tN中的最小值确定所述第一概率。
  5. 如权利要求1至3任一项所述的方法,其特征在于,所述存储系统中存储有第一统计表,所述第一统计表包含所述第一索引指纹所代表的多个指纹的第一区间的数值的统计信息,以及所述第一索引指纹所代表的多个指纹的第二区间的数值的统计信息,所述第一区间为各指纹的第h位至第i位的区间,所述第二区间为各指纹的第j位至第k位的区间,其中,h、i、j、k均为自然数,h的值不大于i的值,j的值不大于k的值,所述第一区间和所述第二区间不重叠;所述第一概率的确定方式包括:
    根据所述第一统计表确定a在所述第一索引指纹所代表的多个指纹的所述第一区间的数值中的出现频次t1以及b在所述第一索引指纹所代表的多个指纹的所述第二区间的数值中出现的频次t2,其中,a为待存储数据块的指纹的第h位至第i位的数值,b为待存储数据块的指纹的第j位至第k位的数值;
    根据所述t1和t2中的最小值确定所述第一概率。
  6. 一种备份服务器,其特征在于,所述备份服务器应用于存储系统中,所述存储系统包括所述备份服务器以及多个存储器,所述存储系统中存储有多个指纹表,所述多个指纹表中记录有已存储于所述多个存储器中的数据块的指纹,所述备份服务器包括:
    确定模块,用于根据指纹索引表中的索引指纹以及待存储数据块的指纹确定第一指纹集合,其中,所述第一指纹集合中包含有第一索引指纹、第二索引指纹,所述第一索引指纹用于代表第一指纹表中的多个指纹,所述第二索引指纹用于代表第二指纹表中的多个指纹,所述待存储数据块的指纹属于所述第一 索引指纹所代表的多个指纹以及所述第二索引指纹所代表的多个指纹的指纹范围;
    获得模块,用于根据所述第一索引指纹获得第一指纹表中包含有与所述待存储数据块的指纹相同的指纹的第一概率,并根据所述第二索引指纹获得所述第二指纹表中包含有与所述待存储数据块的指纹相同的指纹的第二概率,其中,所述第一概率是根据所述第一索引指纹代表的多个指纹确定的,所述第二概率是根据所述第二索引指纹代表的多个指纹确定的;
    所述确定模块,还用于根据所述第一概率和第二概率确定第二指纹集合,其中,所述第二指纹集合中至少包含有所述第一索引指纹,根据所述第一索引指纹确定的第一概率不小于预设阈值;
    处理模块,用于获得所述第一索引指纹所代表的多个指纹与所述待存储数据块的指纹的匹配结果。
  7. 如权利要求6所述的备份服务器,其特征在于,所述第一指纹表存储在所述多个存储器中的第一存储器中,所述第二指纹表存储在所述多个存储器中的第二存储器中;
    所述获得模块具体用于:将所述待存储数据块的指纹以及所述第一索引指纹发送至所述第一存储器;并接收所述第一存储器返回的所述第一概率,所述第一概率用于表示在所述第一索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的概率;以及
    将所述待存储数据块的指纹以及所述第二索引指纹发送至所述第二存储器;并接收所述第二存储器返回的所述第二概率,所述第二概率用于表示在所述第二索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的概率。
  8. 如权利要求6所述的备份服务器,其特征在于,所述备份服务器还包括:
    辅助存储器,用于存储第一指纹表以及所述第二指纹表;
    所述获得模块具体用于:将所述待存储数据块的指纹以及所述第一索引指纹、所述第二索引指纹发送至所述辅助存储器;接收所述辅助存储器返回的在所述第一索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的所述第一概率,以及在所述第二索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的所述第二概率。
  9. 如权利要求8所述的备份服务器,其特征在于:
    所述第一指纹表中的每个指纹包含M位,每个M位指纹包含N个区间,所述N个区间中的每个区间包括M位中连续的S位,所述N个区间中任意两个区间不重叠,所述N个区间的位数之和为M,N为大于或者等于2的自然数,S为自然数;所述辅助存储器中还用于存储第一统计表,所述第一统计表包含所述第一索引指纹所代表的多个指纹的所述N个区间的数值的统计信息;
    所述辅助存储器还用于:根据所述第一统计表确定ai在所述第一索引指纹所代表的多个指纹的所述第i区间的数值中的出现频次ti,其中,ai为待存储数据块的指纹的第i区间的数值,i的取值范围为1至N,并根据所t1至tN中的最小值确定所述第一概率。
  10. 如权利要求8所述的备份服务器,其特征在于,所述辅助存储器还用于存储第一统计表,所述第一统计表包含所述第一索引指纹所代表的多个指纹的第一区间的数值的统计信息,以及所述第一索引指纹所代表的多个指纹的第二区间的数值的统计信息,所述第一区间为各指纹的第h位至第i位的区间,所述第二区间为各指纹的第j位至第k位的区间,其中,h、i、j、k均为自然数,h的值不大于i的值,j的值不大于k的值,所述第一区间和所述第二区间不重叠;
    所述辅助存储器还用于:根据所述第一统计表确定a在所述第一索引指纹所代表的多个指纹的所述第一区间的数值中的出现频次t1以及b在所述第一索引指纹所代表的多个指纹的所述第二区间的数值中出现的频次t2,其中,a为待存储数据块的指纹的第h位至第i位的数值,b为待存储数据块的指纹的第j 位至第k位的数值;并根据所述t1和t2中的最小值确定所述第一概率。
  11. 一种存储系统,其特征在于,包括备份服务器以及多个存储器,所述存储系统中存储有多个指纹表,所述多个指纹表中记录有已存储于所述多个存储器中的数据块的指纹;
    所述备份服务器用于:
    根据指纹索引表中的索引指纹以及待存储数据块的指纹确定第一指纹集合,其中,所述第一指纹集合中包含有第一索引指纹、第二索引指纹,所述第一索引指纹用于代表第一指纹表中的多个指纹,所述第二索引指纹用于代表第二指纹表中的多个指纹,所述待存储数据块的指纹属于所述第一索引指纹所代表的多个指纹以及所述第二索引指纹所代表的多个指纹的指纹范围;
    根据所述第一索引指纹获得第一指纹表中包含有与所述待存储数据块的指纹相同的指纹的第一概率,并根据所述第二索引指纹获得所述第二指纹表中包含有与所述待存储数据块的指纹相同的指纹的第二概率,其中,所述第一概率是根据所述第一索引指纹代表的多个指纹确定的,所述第二概率是根据所述第二索引指纹代表的多个指纹确定的;
    根据所述第一概率和第二概率确定第二指纹集合,其中,所述第二指纹集合中至少包含有所述第一索引指纹,根据所述第一索引指纹确定的第一概率不小于预设阈值;
    获得所述第一索引指纹所代表的多个指纹与所述待存储数据块的指纹的匹配结果。
  12. 如权利要求11所述的存储系统,其特征在于,所述第一指纹表存储在所述多个存储器中的第一存储器中,所述第二指纹表存储在所述多个存储器中的第二存储器中;所述备份服务器具体用于:
    将所述待存储数据块的指纹以及所述第一索引指纹发送至所述第一存储器;以及接收所述第一存储器返回的所述第一概率,所述第一概率用于表示在所述第一索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相 同的指纹的概率;
    将所述待存储数据块的指纹以及所述第二索引指纹发送至所述第二存储器;以及接收所述第二存储器返回的所述第一概率,所述第一概率用于表示在所述第二索引指纹所代表的多个指纹中包含有与所述待存储数据块的指纹相同的指纹的概率;
    所述第一存储器具体用于:接收所述备份服务器发送的第一索引指纹以及所述待存储数据块的指纹,并确定在所述第一索引指纹代表的多个指纹中包含与所述待存储数据块的指纹相同的指纹的第一概率,并将所述第一概率发送至所述备份服务器;
    所述第二存储器具体用于:接收所述备份服务器发送的第二索引指纹以及所述待存储数据块的指纹,并确定在所述第二索引指纹代表的多个指纹中包含与所述待存储数据块的指纹相同的指纹的第二概率,并将所述第二概率发送至所述备份服务器。
  13. 如权利要求11或12所述的存储系统,其特征在于,所述第一指纹表中的每个指纹包含M位,每个M位指纹包含N个区间,所述N个区间中的每个区间包括M位中连续的S位,所述N个区间中任意两个区间不重叠,所述N个区间的位数之和为M,N为大于或者等于2的自然数,S为自然数;所述第一存储器上存储有第一统计表,所述第一统计表包含所述第一索引指纹所代表的多个指纹的所述N个区间的数值的统计信息;
    所述第一存储器具体用于:根据所述第一统计表确定ai在所述第一索引指纹所代表的多个指纹的所述第i区间的数值中的出现频次ti,其中,ai为待存储数据块的指纹的第i区间的数值,i的取值范围为1至N;
    根据所t1至tN中的最小值确定所述第一概率。
  14. 如权利要求11或12所述的存储系统,其特征在于:所述第一存储器上存储有第一统计表,所述第一统计表包含所述第一索引指纹所代表的多个指纹的第一区间的数值的统计信息,以及所述第一索引指纹所代表的多个指纹的 第二区间的数值的统计信息,所述第一区间为各指纹的第h位至第i位的区间,所述第二区间为各指纹的第j位至第k位的区间,其中,h、i、j、k均为自然数,h的值不大于i的值,j的值不大于k的值,所述第一区间和所述第二区间不重叠;
    所述第一存储器具体用于:
    根据所述第一统计表确定a在所述第一索引指纹所代表的多个指纹的所述第一区间的数值中的出现频次t1以及b在所述第一索引指纹所代表的多个指纹的所述第二区间的数值中出现的频次t2,其中,a为待存储数据块的指纹的第h位至第i位的数值,b为待存储数据块的指纹的第j位至第k位的数值;
    根据所述t1和t2中的最小值确定所述第一概率。
PCT/CN2016/091054 2015-07-31 2016-07-22 一种数据处理方法、备份服务器及存储系统 WO2017020735A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510468057.9 2015-07-31
CN201510468057.9A CN106407226B (zh) 2015-07-31 2015-07-31 一种数据处理方法、备份服务器及存储系统

Publications (1)

Publication Number Publication Date
WO2017020735A1 true WO2017020735A1 (zh) 2017-02-09

Family

ID=57942441

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/091054 WO2017020735A1 (zh) 2015-07-31 2016-07-22 一种数据处理方法、备份服务器及存储系统

Country Status (2)

Country Link
CN (1) CN106407226B (zh)
WO (1) WO2017020735A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107317723B (zh) * 2017-05-27 2021-01-05 北京金山安全软件有限公司 一种数据处理方法及服务器
CN108304503A (zh) * 2018-01-18 2018-07-20 阿里巴巴集团控股有限公司 一种数据的处理方法、装置及设备
CN110582091B (zh) * 2018-06-11 2023-05-02 中国移动通信集团浙江有限公司 定位无线质量问题的方法和装置
CN111427871B (zh) * 2019-01-09 2024-03-29 阿里巴巴集团控股有限公司 数据处理方法、装置、设备
CN115988002B (zh) * 2023-02-16 2023-08-15 荣耀终端有限公司 一种数据传输方法和电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070071330A1 (en) * 2003-11-18 2007-03-29 Koninklijke Phillips Electronics N.V. Matching data objects by matching derived fingerprints
CN101477523A (zh) * 2008-11-24 2009-07-08 北京邮电大学 超大型指纹库的索引结构和检索方法
CN103235791A (zh) * 2013-03-29 2013-08-07 厦门雅迅网络股份有限公司 一种基于秩次的指纹匹配优化定位方法
CN103959259A (zh) * 2012-11-20 2014-07-30 华为技术有限公司 数据存储方法、数据存储装置及数据存储系统

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678293B (zh) * 2012-08-29 2020-03-03 百度在线网络技术(北京)有限公司 一种数据存储方法及装置
CN103019887B (zh) * 2012-12-12 2016-01-06 华为技术有限公司 数据备份方法及装置
CN103514250B (zh) * 2013-06-20 2017-04-26 易乐天 一种全局重复数据删除的方法和系统及存储装置
US9716733B2 (en) * 2013-09-23 2017-07-25 Spotify Ab System and method for reusing file portions between different file formats

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070071330A1 (en) * 2003-11-18 2007-03-29 Koninklijke Phillips Electronics N.V. Matching data objects by matching derived fingerprints
CN101477523A (zh) * 2008-11-24 2009-07-08 北京邮电大学 超大型指纹库的索引结构和检索方法
CN103959259A (zh) * 2012-11-20 2014-07-30 华为技术有限公司 数据存储方法、数据存储装置及数据存储系统
CN103235791A (zh) * 2013-03-29 2013-08-07 厦门雅迅网络股份有限公司 一种基于秩次的指纹匹配优化定位方法

Also Published As

Publication number Publication date
CN106407226A (zh) 2017-02-15
CN106407226B (zh) 2019-09-13

Similar Documents

Publication Publication Date Title
US11474972B2 (en) Metadata query method and apparatus
CN108519862B (zh) 区块链系统的存储方法、装置、系统和存储介质
US9021189B2 (en) System and method for performing efficient processing of data stored in a storage node
US9092321B2 (en) System and method for performing efficient searches and queries in a storage node
WO2017020735A1 (zh) 一种数据处理方法、备份服务器及存储系统
US10311105B2 (en) Filtering queried data on data stores
US8819335B1 (en) System and method for executing map-reduce tasks in a storage device
US8224875B1 (en) Systems and methods for removing unreferenced data segments from deduplicated data systems
WO2017219858A1 (zh) 分布式流式数据处理的方法和装置
US20170124077A1 (en) Flash module provided with database operation unit, and storage device
EP3876106A1 (en) File storage method and deletion method, server, and storage medium
US11176110B2 (en) Data updating method and device for a distributed database system
US20240126817A1 (en) Graph data query
US9336135B1 (en) Systems and methods for performing search and complex pattern matching in a solid state drive
EP2919120B1 (en) Memory monitoring method and related device
US20220164316A1 (en) Deduplication method and apparatus
CN109407985B (zh) 一种数据管理的方法以及相关装置
US9213759B2 (en) System, apparatus, and method for executing a query including boolean and conditional expressions
US20240070120A1 (en) Data processing method and apparatus
US20180011897A1 (en) Data processing method having structure of cache index specified to transaction in mobile environment dbms
WO2018205689A1 (zh) 合并文件的方法、存储装置、存储设备和存储介质
US11789639B1 (en) Method and apparatus for screening TB-scale incremental data
CN111459937A (zh) 数据表关联方法、装置、服务器及存储介质
CN104298614A (zh) 数据块在存储设备中存储方法和存储设备
CN107846327A (zh) 一种网管性能数据的处理方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16832223

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16832223

Country of ref document: EP

Kind code of ref document: A1