WO2015061995A1 - Data processing method, device, and duplication processor - Google Patents

Data processing method, device, and duplication processor Download PDF

Info

Publication number
WO2015061995A1
WO2015061995A1 PCT/CN2013/086253 CN2013086253W WO2015061995A1 WO 2015061995 A1 WO2015061995 A1 WO 2015061995A1 CN 2013086253 W CN2013086253 W CN 2013086253W WO 2015061995 A1 WO2015061995 A1 WO 2015061995A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
index
index value
query unit
unit
Prior art date
Application number
PCT/CN2013/086253
Other languages
French (fr)
Chinese (zh)
Inventor
于传帅
张程伟
张宗全
林春恭
游俊
刘强
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201380002568.0A priority Critical patent/CN103930890B/en
Priority to PCT/CN2013/086253 priority patent/WO2015061995A1/en
Publication of WO2015061995A1 publication Critical patent/WO2015061995A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/328Management therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1756De-duplication implemented within the file system, e.g. based on file segments based on delta files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Definitions

  • Embodiments of the present invention relate to storage technologies, and in particular, to a data processing method, apparatus, and deduplication processor.
  • Deduplication also known as smart compression or single instance storage
  • Smart compression or single instance storage is an automatic search for duplicate data, leaving only the same copy of the same data, and replacing other duplicates with pointers to a single copy to eliminate redundancy.
  • Data storage technology that reduces storage capacity requirements.
  • the deduplication method can employ a fixed length blocking algorithm.
  • the fingerprint algorithm is used to calculate the fingerprint of the data object in the sliding window. If the predetermined condition is met, the starting position and the ending position of the sliding window are used as the boundary of the data block, and the data object is segmented by continuously sliding the window and calculating the fingerprint. For each data block obtained by dividing, it is necessary to first determine whether the data block is greater than the length lower limit value, and if greater than the length lower limit value, calculate the fingerprint value of the data block, such as a hash value, and the fingerprint stored in the storage device.
  • the data object may refer to the stored data block in the storage device. If the fingerprint value of the data block does not exist in the storage device, the data block and its fingerprint value may be stored in the storage device for subsequent use. Repeated data judgment.
  • Embodiments of the present invention provide a data processing method, apparatus, and a deduplication processor, which reduce memory usage and meet the increasing demand for data.
  • an embodiment of the present invention provides a data processing method, where the method is applied to a data processing system, where the data processing system includes a deduplication processor, and the method includes: the deduplication processor will slide a window Covering data that requires repeated data search as a first query unit, the first query unit includes a plurality of minimum data blocks, and the minimum data block is a data block of a minimum query unit for performing repeated data search; The data in a query unit is indexed and the data is searched.
  • the index structure includes: extracting a partial bit from each of the minimum data blocks in the first query unit, and extracting the extracted bits. An index value of a preset length corresponding to the first query unit;
  • the repeated data search includes: querying, in a preset index table, whether there is an index value that is the same as the index value corresponding to the first query unit, and if the index table is queried to correspond to the first query unit And searching for the first index value with the same index value, and searching whether the data in the first query unit has data that is repeated with the target data pointed to by the data storage address corresponding to the first index value.
  • the embodiment of the present invention provides a first possible implementation manner, where the method further includes: if the data in the first query unit and the data storage address corresponding to the first index value point to a target The data is completely repeated: the data before the start position of the sliding window is used as the second query unit, and the previous is the needle
  • the second query unit includes at least one minimum data block, and one index of the preset length is constructed according to the at least one minimum data block in the second query unit. a value, in the index table, whether there is a second index value that is the same as the index value corresponding to the second query unit;
  • the second index value that is the same as the index value corresponding to the second query unit is queried in the index table, whether the data in the second query unit has data corresponding to the second index value is found.
  • the data of the target data pointed to by the storage address is duplicated.
  • the embodiment of the present invention further provides a second possible method, where the method further includes:
  • the index value corresponding to the second query unit is compared with the data in the second query unit. A correspondence between the storage addresses is inserted into the index table.
  • the embodiment of the present invention provides a third possible manner, if the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely duplicated, the method further includes : determining whether the data size before the start position of the sliding window reaches the size of the first query unit, where the previous direction is for the opposite direction of the sliding window sliding, and if not, the preset step size The sliding window is slid, and the data in the sliding window after sliding is used as a first query unit, and the step of constructing the index and the step of searching for the repeated data are performed.
  • the embodiment of the present invention further provides a fourth possible manner, if no query is found in the index table.
  • the first query unit corresponds to the first index value with the same index value, and the method further includes:
  • the embodiment of the present invention further provides a fifth possible manner, if the first index value that is the same as the index value corresponding to the first query unit is not queried in the index table.
  • the method further includes: matching, in the index table, the index value corresponding to the first query unit a third index value equal to or higher than a preset matching degree, if the index table does not have a third index value that matches the index value of the first query unit equal to or higher than a preset matching degree, Then, the positional values of the numerical values in the index values corresponding to the first query unit are sequentially arranged and combined, and it is determined whether the index value corresponding to the first query unit after the array combination is found in the index table. The index value, if not found, enters the step of determining whether the data before the start position of the sliding window reaches the size of the first query unit.
  • the embodiment of the present invention provides a sixth possible implementation manner, if the first index value corresponding to the first query unit is found in the index table is equal to or If the third index value is higher than the preset matching degree, it is searched whether the data in the first query unit has data duplicated by the target data pointed to by the data storage address corresponding to the third index value.
  • the embodiment of the present invention further provides a seventh possible method. , also includes:
  • an embodiment of the present invention provides a data processing apparatus, including: an index construction unit, configured to: an index structure, where the index structure includes: data that is covered by a sliding window in data that requires repeated data query as a first a query unit, which extracts a partial bit from the fingerprint value of each of the smallest data blocks in the first query unit, and the extracted bits constitute an index value of a preset length corresponding to the first query unit, where
  • the first query unit includes a plurality of minimum data blocks, where the minimum data block is a data block of a minimum query unit for performing repeated data search; and an index matching unit is configured to query whether the data is in a preset index table.
  • the first query unit corresponds to an index value with the same index value;
  • a data search unit configured to: if the index matching unit searches for the first index value that is the same as the index value corresponding to the first query unit in the index table, whether the data in the first query unit is searched for There is data that is repeated with the target data pointed to by the data storage address corresponding to the first index value.
  • the embodiment of the present invention provides a first implementation manner, where the repeated data searching unit obtains the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value. Repeat, then The index construction unit is further configured to use data before the start position of the sliding window as a second query unit, where the previous query is for the opposite direction of the sliding window sliding, according to the second query.
  • the at least one minimum data block in the unit constructs an index value of the preset length, where the second query unit includes at least one minimum data block; the index matching unit is further used in the index table Whether the middle query has a second index value that is the same as the index value corresponding to the second query unit;
  • the duplicate data searching unit is further configured to: if the index matching unit searches for the second index value that is the same as the index value corresponding to the second query unit in the index table, searching for the second query unit Whether there is data in the data that is duplicated by the target data pointed to by the data storage address corresponding to the second index value.
  • the embodiment of the present invention provides a second possible manner, if the index matching unit does not query the index value corresponding to the second query unit in the index table.
  • the second index value further includes: a first storage unit, configured to store data in the second query unit;
  • the first index table updating unit is configured to insert a correspondence between an index value corresponding to the data of the second query unit and a storage address of the data in the second query unit into the preset index table.
  • the data processing apparatus of the embodiment of the present invention further provides a third possible manner, if the duplicate data searching unit searches for data in the first query unit and a data storage address corresponding to the first index value.
  • the pointing target data is not completely repeated
  • the device further includes: a first determining unit, configured to find, by the duplicate data searching unit, the first query unit When the target data and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, determining data before the start position of the sliding window, where the previous direction is for the opposite direction of sliding of the sliding window Whether the size of the first query unit is reached; the first instruction unit is configured to: before the first determining unit determines that the sliding window determines the start position of the sliding window, the data does not reach the first query unit When the size is small, the sliding window is slid in a preset step size;
  • the index construction unit is further configured to construct the index by using the data in the sliding window after sliding as a first query unit.
  • the data processing apparatus of the embodiment of the present invention provides a fourth type, if the index matching unit is in the index
  • the device does not query the first index value that is the same as the index value corresponding to the first query unit, and the device further includes: a second determining unit, configured to search, in the first query unit, by the duplicate data searching unit When the data and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, the data before the start position of the sliding window is determined, and the previous direction is for the opposite direction of sliding of the sliding window.
  • the second instruction unit is configured to: when the second determining unit determines that the sliding window determines the start position of the sliding window, the data size does not reach the first query unit
  • the sliding window is slid in a preset step size; the index construction unit is further used to slide the sliding window
  • the data is constructed as a first query unit.
  • the embodiment of the present invention further provides a fifth possible manner, if the index matching unit does not query the same index value corresponding to the first query unit in the index table.
  • the first index value is further used to query, in the index table, a third index value that matches the index value corresponding to the first query unit with a matching degree equal to or higher than a preset matching degree; And the third query value corresponding to the index matching value of the first query unit is equal to or higher than the preset matching degree, and the positions of the numerical values in the index values corresponding to the first query unit are sequentially arranged and combined, and the judgment is performed. And determining, by the index value corresponding to the first query unit, whether the same third index value is found in the index table;
  • the second determining unit is further configured to: when the index matching unit finally fails to match the third index value, determine a data size before the start position of the sliding window, where the previous is for the sliding Whether the size of the first query unit is reached in the reverse direction of the window sliding, and the result is sent to the second instruction unit.
  • the repeated data searching unit is further configured to: if the index matching unit queries the index table in the index table If the first index value corresponding to the first query unit is equal to or higher than the third index value of the preset matching degree, it is searched whether the data in the first query unit corresponds to the third index value.
  • the data storage address points to the duplicate data of the target data.
  • the embodiment of the present invention further provides a seventh possible manner, where the method further includes: a delta compression unit, configured to obtain a duplicate data search result of the duplicate data search unit, if The data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely overlapped, and the data storage address corresponding to the first index value points to Marking data, performing delta compression on the data in the first query unit; the second storage unit is further configured to store data after the delta compression is completed; the second index table updating unit is further configured to: The correspondence between the index value corresponding to the first query unit and the data storage address after the delta compression is inserted into the index table.
  • a delta compression unit configured to obtain a duplicate data search result of the duplicate data search unit, if The data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely overlapped, and the data storage address corresponding to the first index value points to Marking data, performing delta compression on the data in the first query unit
  • the second storage unit
  • the index is indexed by the index value corresponding to the data of the first query unit, and the first query unit includes a plurality of minimum data blocks, and is formed by taking some bits from each minimum data block.
  • the index matching time is greatly reduced, the index matching efficiency is improved, and the memory occupancy of the index is also greatly reduced.
  • FIG. 1 is a schematic structural diagram of a data processing system according to an embodiment of the present invention
  • FIG. 1 is a schematic diagram of another structure of a data processing system according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a data processing method according to an embodiment of the present invention
  • FIG. 2A is a schematic diagram of an index structure provided by an embodiment of the present invention.
  • FIG. 3 is a flowchart of another data processing method according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure.
  • FIG. 6 is a schematic structural diagram of a deduplication processor according to an embodiment of the present invention. detailed description
  • the embodiment of the present invention provides a data processing system, where the data processing system includes at least one de-duplication processor and at least one storage node, and a de-duplication processor and a storage node are deployed.
  • the data processing system includes at least one de-duplication processor and at least one storage node, and a de-duplication processor and a storage node are deployed.
  • mode 1 Each deduplication processor is connected to the storage node through a network.
  • the deduplication processor can be deployed as a software or a separate hardware device or integrated on other hardware devices, and deployed in the user. side;
  • the de-duplication processor may be integrated on the storage node as a hardware device, or may be deployed as a software function module on the storage node, and processed after receiving the data sent by the user.
  • a data processing system 10 provided in an embodiment of the present invention includes at least one deduplication processor 101, 102, 10n, and a plurality of storage nodes 111, 112, 11n; each deduplication processor receives a user transmission through an interface.
  • the incoming data may be a standard protocol interface, such as an NFS protocol interface.
  • an index table is preset in the data processing system, where the index table includes a correspondence between an index value corresponding to data already stored in the storage node and a data storage address, and the index table It can be stored in each deduplication processor or in a storage node.
  • Each of the deduplication processors and each storage node in the data processing system are connected, such as a network connection or other manner of connection.
  • 2 is a flowchart of a data processing method according to an embodiment of the present invention. As shown in FIG. 2, the method of the embodiment of the present invention is applied to a data processing system, including:
  • Step 201 The deduplication processor uses data in the sliding window in the data of the repeated data query as a rated size data block, the nominal size data block includes a plurality of minimum data blocks, and the minimum data block is used for performing repeated data.
  • the minimum data block is a data block of a minimum query unit in a repeated data search, and is usually 4 KB. If it is a variable component block, the minimum query unit size is about 4 KB, and the minimum query is in the embodiment of the present invention.
  • the size of the data block of the unit is not limited; for convenience of description, the data in the sliding window is used as the first query unit;
  • Step 202 Perform an index construction, including: extracting, from the fingerprint values of each of the minimum data blocks in the first query unit, a partial bit, and forming the extracted bits into a pre-corresponding to the first query unit. Set the index value of the length;
  • the fingerprint value of each of the smallest data blocks in the first query unit is obtained, and the preset number of the same number of bits is extracted from the fingerprint value corresponding to each minimum data block. All the extracted bits are grouped into an index value corresponding to the first query unit. As shown in FIG. 2-A, the size of the rated data block is 32 KB, including 8 minimum data blocks of 4 KB, respectively. Five bits are obtained from the fingerprint values corresponding to each of the smallest data blocks, and the extracted bits are combined into an index value of 40 bits.
  • Step 203 Query whether there is an index value that is the same as the index value corresponding to the first query unit in a preset index table, and if the index value corresponding to the first query unit is the same in the index table, An index value, proceeds to step 204;
  • the first query unit is matched to the same index value in the index table as the first index value
  • the specific steps of the deduplication processor in performing the index query in step 203 are different:
  • the deduplication processor may query, in the local index table, whether there is a first index value index value that is the same as the index value corresponding to the first query unit, and obtain search result;
  • the de-duplication processor may send the index value corresponding to the first query unit to the storage node, and the storage node queries the index table whether there is a corresponding to the first query unit.
  • the first index value with the same index value the deduplication processor receives the query result fed back by the storage node.
  • Step 204 Search for data of the size of the data in the first query unit that has duplicated target data pointed to by the data storage address corresponding to the first index value.
  • the method for specifically searching in step 204 may be: loading target data pointed to by the data storage address corresponding to the first index value into a data search repeat included in the deduplication processor and the first query unit Transmitting the data included in the first query unit to the storage node where the target data is located and the target data to search for duplicate data, or may be corresponding to the target data pointed to by the data storage address corresponding to the first index value.
  • the fingerprint value is compared with the fingerprint value corresponding to the data included in the first query unit to find the duplicate data, and the specific manner is not limited in the embodiment of the present invention. Set.
  • the rated size data block includes a plurality of minimum data blocks, and the plurality of minimum data blocks are constructed into an index, which greatly reduces the number of indexes.
  • Step 205 determining whether the size of the data before the start position of the sliding window reaches the size of the nominal size data block, where the previous direction is to the opposite direction of sliding of the sliding window, and if not, proceeding to step 206; , then proceeds to step 207;
  • the data to be queried needs to be deleted, and the data already stored is increased.
  • Reference counting and the like if a part of the data to be queried is determined to be new data, the part of the data is stored, and part of the data in the queried data is deleted or stored so that it appears in the queried data.
  • the data breakpoint in the description of the embodiment before the sliding window starting position, is preceded by the opposite direction of sliding of the sliding window.
  • Step 206 sliding the sliding window in a preset step, the data in the sliding window after sliding as a first query unit, return to step 202;
  • Step 207 Store data before the start position of the sliding window as new data, which is previously for the opposite direction of sliding of the sliding window;
  • the sliding window is slid by one step and then searched, and the data size before the sliding window reaches the data block of the rated size, then , you can know that the sliding window has been slid by the length of a data block of the rated size, and the data of this length is also the data covered by the previous sliding window, because the previous sliding window covers the data and has already judged the corresponding data.
  • the index can not find the same index in the index table, therefore, at this time, the data before the sliding window can be directly stored as new data, where the sliding direction is in front of the sliding window;
  • Step 208 Insert the obtained index value corresponding to the new data and the corresponding relationship of the new data storage address into the index table.
  • the new data is the data covered by the previous sliding window, and when the sliding window covers the data, the index value corresponding to the data has been calculated, and if the index values are saved, then Obtaining the index value of the new data directly; if not, the index corresponding to the new data may be obtained according to the foregoing method for obtaining the corresponding index of the first query unit.
  • the index table may be blank at the beginning, and is continuously updated by continuously inserting the corresponding relationship between the index value corresponding to the new data and the new data storage address in the subsequent repeated data search process, where the new data is also found. Non-repeating data.
  • an index is constructed according to a plurality of minimum data blocks in the sliding window. Therefore, when the data in the sliding window, that is, the data in the first query unit, matches the same first index value in the index table. When the data pointed to by the data address corresponding to the first index value is compared with the data in the first query unit, it is determined whether the data in the first query unit is duplicated with the already stored data. Before the comparison of the data, it is possible to judge whether the data is repeated by comparing the fingerprint values corresponding to the data in the prior art.
  • the method may further include:
  • Step 209 it is determined whether the fingerprint value of the data in the first query unit is exactly the same as the fingerprint value of the target data, if yes, then proceeds to step 210, and if not, proceeds to step 205;
  • the embodiment of the present invention will After the sliding window slides for one step and then searches, and the data size before the start position of the sliding window reaches the rated size of the data block, it can be known that the sliding window has been slid by the length of the data block of the rated size, and The data of this length is also the data covered by the previous sliding window, because the previous sliding window covers the data and it has been judged that the data does not completely overlap with the target data, so at this time, the current sliding window can be directly The data before the start position is directly stored as the new data. Therefore, when it is determined that the fingerprint value of the data in the first query unit is not exactly the same as the fingerprint value of the target data, the process may directly proceed to step 205; Deduplicate the data in the first query unit.
  • the data in the first query unit is duplicate data
  • the specific method for performing data deletion on the duplicate data may refer to the prior art.
  • the embodiment of the present invention further includes:
  • Step 211 The data before the start position of the sliding window is used as a second query unit, where the second query unit includes at least one minimum data block, in the reverse direction of the sliding window.
  • the at least one minimum data block in the second query unit constructs an index value having the same length as the index value corresponding to the first query unit, and querying in the index table whether the second query unit corresponds to the second query unit. a second index value with the same index value, and if so, proceeds to step 212; if not, proceeds to step 213;
  • the second query unit may include only one minimum data block, for example, 4 KB data. In this case, it is also necessary to construct an index value corresponding to the first query unit according to the minimum data block. Index values of the same length;
  • the second query unit includes a plurality of minimum data blocks, it is required to construct an index value having the same length as the index value corresponding to the first query unit according to all the minimum data blocks included, for example, if the index value The length needs 40 bits, there are two minimum data blocks, then 20 bits are needed to obtain the 40-bit index value from the fingerprint value of each minimum data block; Step 212, find the second query unit Whether there is data in the data that is duplicated by the target data pointed to by the data storage address corresponding to the second index value.
  • the method for searching for the duplicate data in step 212 may be: comparing the target data pointed to by the data storage address corresponding to the second index value to the data included in the second query unit after loading the deduplication processor Querying the duplicate data or sending the data included in the second query unit to the data storage address corresponding to the second index value to query the duplicate data; or the data storage address corresponding to the first index value may be pointed to Fingerprint value corresponding to the target data and the first check The fingerprint values corresponding to the data included in the unit are compared to find duplicate data.
  • Step 213 The data in the second query unit is stored as new data, and the correspondence between the index value corresponding to the second query unit and the data storage address in the second query unit is inserted into the index table. in.
  • the data is reference counted, compressed, etc., in units of rated size data blocks, the memory occupation is greatly reduced, and the data search process is repeated.
  • the repeated data is searched in a form in which the data block of the rated size and the data block smaller than the rated size are mixed, and the deduplication rate is improved.
  • an embodiment of the present invention further provides another data processing method.
  • the difference is that when the same index value is not found in the index table, in order to improve the probability of finding the same index, after the index value is changed, the search is continued.
  • the data processing method described in FIG. 3 includes:
  • Step 301 The deduplication processor uses data in the sliding window in the data of the repeated data query as a nominal size data block, the nominal size data block includes a plurality of minimum data blocks, and the minimum data block is used to perform repeated data.
  • Step 302 Perform an index construction, including: constructing, according to the plurality of minimum data blocks in the first query unit, a preset length index value of the first query unit;
  • Step 303 Query whether there is a first index value that is the same as the index value corresponding to the first query unit, and if the index value corresponding to the first query unit is the same in the index table, The first index value, then proceeds to step 308; if not, then proceeds to step 304; Step 304: Query, in the index table, a third index value that matches the index value corresponding to the first query unit with a matching degree equal to or higher than a preset matching degree. If yes, go to step 308; Go to step 306;
  • obtaining the third index value and obtaining the first index value may be completed in one step in the actual operation, and the embodiment of the present invention is logically written in two steps for the sake of clearer description;
  • Step 306 it is determined whether more than one permutation combination period, if not exceeded, then proceeds to step 307; if exceeded, proceeds to step 309;
  • Step 307 After the positions of the data in the index value corresponding to the first query unit are sequentially arranged and combined, the process returns to step 303.
  • the position of the data in the index value corresponding to the first query unit is changed in sequence, and there are various ways of changing, and the embodiment of the present invention adopts the arrangement and combination.
  • the specific data location order may be changed by dividing the index value corresponding to the first query unit into multiple parts, and the positions of the multiple parts in the index value corresponding to the first query unit.
  • Step 308 Find whether the data in the first query unit has a target pointed to by the data storage address corresponding to the first index value. Data multiplexed; step 309, determining data before the start position of the sliding window, whether the size of the nominal size data block is reached for the reverse direction of the sliding window sliding, if not, Then proceed to step 310; if yes, proceed to step 311;
  • Step 301 sliding the sliding window in a preset step, the data in the sliding window after sliding as a first query unit, return to step 302;
  • Step 311 The data before the start position of the sliding window is stored as new data.
  • Step 312 Insert the obtained correspondence between the index value corresponding to the new data and the new data storage address into the In the index table.
  • the method may further include after step 308:
  • Step 313 it is determined whether the data in the first query unit and the data pointed to by the data address corresponding to the index value are completely repeated; if not completely repeated, proceed to step 314;
  • the specific method of judging whether or not to completely repeat can be judged by comparing the fingerprint values.
  • Step 314 According to the data pointed to by the data address corresponding to the queried index value, The data in the first query unit is delta compressed;
  • the specific delta compression algorithm may have an algorithm such as zdelt or vcdiff or xdelta, which is not limited in the embodiment of the present invention
  • Step 315 Store data obtained by performing delta compression, and insert a correspondence between an index value corresponding to the first query unit and a storage address of the data obtained by the delta compression into the index table.
  • the method may further include:
  • Step 316 Perform data deletion on the data in the first query unit.
  • Step 317 The data before the start position of the sliding window is used as a second query unit.
  • the length of the index value corresponding to the first query unit is constructed according to the at least one minimum data block in the second query unit.
  • the same index value, in the index table, query whether there is a second index value that is the same as the index value corresponding to the second query unit, and if yes, go to step 318; if no, go to step 319; Step 318, find Whether there is data in the data in the second query unit that is duplicated by the target data pointed to by the data storage address corresponding to the second index value.
  • the finding the duplicate data in the step 318 may be: loading the target data pointed to by the data storage address corresponding to the second index value into the deduplication processor, and comparing with the data included in the second query unit to query Deduplicating the data or sending the data included in the second query unit to the data storage address corresponding to the second index value to query the duplicate data, or may be the target that points the data storage address corresponding to the first index value
  • the fingerprint value corresponding to the data is compared with the fingerprint value corresponding to the data included in the first query unit to find duplicate data.
  • the repeated data query is completed in step 318, if the data is completely repeated, the data of the second query unit is deduplicated. If the data is not completely repeated, delta compression may be performed or the data of the second query unit may be directly used as new data.
  • the present invention is not limited. The embodiment of the present invention takes as an example the new data is not completely repeated.
  • Step 319 The data in the second query unit is stored as new data, and the correspondence between the index value corresponding to the second query unit and the data storage address in the second query unit is inserted into the index table. in.
  • the data processing method provided by the embodiment of the present invention performs index searching for the rated size data block including the plurality of minimum query units, thereby greatly reducing the memory occupation. Further, the embodiment of the present invention adopts the minimum query unit by sliding the window. After sliding, the judgment is made to avoid the deduplication rate caused by the offset in the data search, and the repeated data search of the mixed granularity is realized, and the deduplication rate is improved while the memory occupation is reduced.
  • the embodiment of the present invention further provides a data processing apparatus, which is used to perform the method provided in the foregoing embodiment.
  • the principle and the technical effect of the implementation are similar to the method provided by the embodiment of the present invention.
  • the data processing device may be a deduplication processor in a specific implementation, or may be any device that performs the same function, such as a storage node installed with a deduplication processor, and is applied to a data processing system, the data processing system. Including the data processing device and a storage node, the data processing device communicating with the storage node;
  • an embodiment of the present invention provides a structure of a data processing apparatus, including: an index construction unit 401, configured to: an index structure, where the index structure includes: data that needs to be repeated in a data query is covered by a sliding window As a first query unit, the data is extracted from the fingerprint values of each of the smallest data blocks in the first query unit, and the extracted bits are extracted. An index value of a preset length corresponding to the first query unit, where the first query unit includes a plurality of minimum data blocks, and the minimum data block is data of a minimum query unit for performing repeated data search.
  • the index matching unit 402 is configured to query, in a preset index table, whether there is an index value that is the same as the index value corresponding to the first query unit, and obtain a matching result;
  • the duplicate data searching unit 403 is configured to: if the index matching unit searches for the first index value that is the same as the index value corresponding to the first query unit in the index table, search for data in the first query unit. Whether there is data that is duplicated by the target data pointed to by the data storage address corresponding to the first index value.
  • the first query unit includes a plurality of minimum data blocks, and a partial index value is obtained from the index value of each minimum data block to construct an index of a preset length, which greatly reduces the index in the memory. quantity.
  • the index construction unit 401 is further used.
  • the data before the start position of the sliding window is used as a second query unit, wherein the previous direction is the reverse direction of the sliding window sliding, according to the at least one minimum in the second query unit.
  • the data block constructs an index value of the preset length, where the second query unit includes at least one minimum data block; the index matching unit 402 is further configured to query, in the index table, whether The second query unit corresponds to a second index value with the same index value;
  • the duplicate data lookup 403 unit is further configured to: if the index matching unit 402 is in the index table Searching for the second index value that is the same as the index value corresponding to the second query unit, searching whether the data in the second query unit has target data pointed to by the data storage address corresponding to the second index value Repeated data.
  • the data processing apparatus may further include: the first storage unit 404, if the index matching unit 401 does not query the second index value that is the same as the index value corresponding to the second query unit in the index table, the data processing apparatus may further include: , for storing data in the second query unit;
  • the first index table updating unit 405 is configured to insert a correspondence between an index value corresponding to the data of the second query unit and a storage address of the data in the second query unit into the preset index table.
  • the index value corresponding to the first query unit is included in the index table, and the index value corresponding to the second query unit is also included, and the size of the data block included in the first query unit and the second query unit is different.
  • a hybrid index value corresponding to a plurality of data block sizes is formed in the index table, and the memory is reduced, and the double deletion search rate is improved.
  • the device may further include: if the duplicate data searching unit 403 finds that the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely duplicated, the device may further include:
  • the first determining unit 406 is configured to determine, when the duplicate data searching unit 403 finds that the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated. Sliding data before the start position of the window, whether the size of the first query unit is reached for the reverse direction of the sliding of the sliding window;
  • a first instruction unit 407 configured to determine, by the first determining unit 406, the sliding window determination When the data before the start position of the sliding window does not reach the size of the first query unit, the sliding window is slid in a preset step size;
  • the index construction unit 401 is further configured to construct a index by using the data in the sliding window after sliding as a first query unit.
  • the device may further include: a second determining unit 408, configured to: When the repeated data searching unit 403 finds that the target data pointed to by the data in the first query unit and the data storage address corresponding to the first index value is not completely repeated, determining data before the start position of the sliding window
  • the second instruction unit 409 is configured to determine, in the second determining unit 408, the sliding window to determine whether the size of the first query unit is reached in the opposite direction of the sliding of the sliding window. When the size of the data before the start position of the sliding window does not reach the size of the first query unit, the sliding window is slid in a preset step size;
  • the index construction unit 401 is further configured to construct a index by using the data in the sliding window after sliding as a first query unit.
  • the data processing apparatus may further include: a second storage unit 410, configured to: if the second determining unit 408 determines the start position of the sliding window The data size before the setting reaches the size of the first query unit, and the data before the start position of the sliding window is stored as new data; the correspondence of the storage addresses is inserted into the index table.
  • an embodiment of the present invention by constructing an index value for a plurality of minimum data partitions, the data is reference counted, compressed, etc., in units of rated size data blocks, the memory occupation is greatly reduced, and the data search process is repeated.
  • the repeated data is searched in a form in which the data block of the rated size and the data block smaller than the rated size are mixed, and the deduplication rate is improved.
  • FIG. 5 an embodiment of the present invention further provides a data processing apparatus, which provides an optimized solution based on the structural diagram of the apparatus provided in FIG. 4, and the apparatus in the embodiment corresponding to the present invention and FIG.
  • the difference is that if the first index value that is the same as the index value corresponding to the first query unit is not found in the index table, whether the size of the data before the start position of the sliding window reaches the first Before the size of the query unit, the positional relationship of the data in the index value corresponding to the first query unit is arranged and combined, and then the index values of the array combination are matched in the index table to improve the matching rate.
  • the data processing apparatus includes: an index construction unit 501, configured to: an index structure, where the index structure includes: data that is covered by the sliding window in the data that needs to be repeated data query as a first query unit And extracting a partial bit from each of the fingerprint values of the minimum data block in the first query unit, and extracting the extracted bits into an index value of a preset length corresponding to the first query unit, where
  • the first query unit includes a plurality of minimum data blocks, and the minimum data block is a data block of a minimum query unit for performing repeated data search;
  • the index matching unit 502 is configured to query, in a preset index table, whether there is an index value that is the same as the index value corresponding to the first query unit, to obtain a matching result, and a duplicate data searching unit 503, configured to: if the index matching unit Querying, in the index table, the first index value that is the same as the index value corresponding to the first query unit, and searching whether the data in the first query unit has a data storage address
  • Duplicate data for the target data pointed to If the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are completely duplicated, the index construction unit 501 is further used. The data before the start position of the sliding window is used as a second query unit, wherein the previous direction is the reverse direction of the sliding window sliding, according to the at least one minimum in the second query unit.
  • the data block constructs an index value of the preset length, where the second query unit includes at least one minimum data block, and the index matching unit 502 is further configured to query, in the index table, whether The second query unit corresponds to a second index value with the same index value; the repeated data search 503 unit is further configured to: if the index matching unit 502 queries the index value corresponding to the second query unit in the index table The same second index value is used to find whether the data in the second query unit has a data storage address corresponding to the second index value. Data duplication of data.
  • the data processing apparatus may further include: if the index matching unit 501 does not query the second index value that is the same as the index value corresponding to the second query unit in the index table, the data processing apparatus may further include: a first storage unit 504, configured to store data in the second query unit; a first index table update unit 505, configured to use an index value corresponding to data of the second query unit and data in the second query unit The correspondence between the storage addresses is inserted into the preset index table. If the index matching unit 502 does not query the first index value that is the same as the index value corresponding to the first query unit, the index matching unit 502 may be further configured to: query and query in the index table.
  • the third index value corresponding to the index matching value of the first query unit is equal to or higher than the preset matching degree; the index value corresponding to the first query unit in the index table is equal to or higher than the preset value.
  • the second determining unit 508 determines, before the index matching unit finally fails to match the third index value, a data size before the start position of the sliding window, where the previous is for the sliding window. In the reverse direction of the sliding, whether the size of the first query unit is reached, and the result is sent to the second instruction unit 509; the second instruction unit 509 is configured to determine, at the second determining unit 508, the When the sliding window determines that the data size before the start position of the sliding window does not reach the size of the first query unit, the sliding window is slid in a preset step size; the index construction unit 501 is further configured to Sliding the data in the sliding window as a A query unit, constructing an index.
  • the data processing apparatus may further include: a second storage unit 510, configured to: if the second determining unit 508 determines that the data size before the start position of the sliding window reaches the first query unit Size, the data before the start position of the sliding window is stored as new data;
  • the correspondence of the storage addresses is inserted into the index table.
  • the duplicate data searching unit 503 is further configured to: if the index matching unit 502 searches the index table, the first index value corresponding to the first query unit is equal to or higher than a preset matching degree. And the third index value is used to search for data in the first query unit for data that is duplicated by the target data pointed to by the data storage address corresponding to the third index value.
  • the data processing apparatus may further include: if the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely duplicated, the data processing apparatus may further include:
  • a delta compression unit 506 configured to acquire a data search result of the duplicate data searching unit 503, if the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, And performing delta compression on the data in the first query unit with respect to the target data pointed to by the data storage address corresponding to the first index value; and storing, by the second storage unit 510, data stored in the delta compression to the storage
  • the second index table updating unit 511 is further configured to insert the correspondence between the index value corresponding to the first query unit and the data storage address after the delta compression into the index table.
  • the data processing apparatus provided by the embodiment of the present invention performs index searching for the rated size data block including the plurality of minimum query units, thereby greatly reducing the memory occupation.
  • the embodiment of the present invention adopts the minimum query unit by sliding the window. After the sliding, the judgment is performed to avoid the deduplication rate caused by the offset in the data search, the repeated data search of the mixed granularity is realized, the deduplication rate is improved, the memory occupation is reduced, and when index matching is performed, By arranging and combining the data in the index values in the first query unit, the repeated data search rate is improved.
  • an embodiment of the present invention further provides a deduplication processor 600, including a processor 61, a memory 62, a communication interface 63, and a communication bus 64.
  • the processor 61, the communication interface 63, and the memory 62 communicate with each other through the communication bus 64; the communication interface is configured to receive and transmit data;
  • the memory 62 is for storing a program; the memory 62 may include a high speed RAM memory, and may also include a non-volatile memory such as at least one disk memory;
  • the processor 61 is configured to execute the program in the memory to execute a data processing method as provided by the foregoing method embodiments.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present invention which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a USB flash drive, a mobile hard disk, and a read only memory (ROM, Read-Only) Memory ), Random Access Memory (RAM), disk or optical disk, and other media that can store program code.
  • ROM Read Only Memory
  • RAM Random Access Memory

Abstract

Embodiments of the present invention, by indexing data primarily according to index values corresponding to data of a first query unit, where the first query unit comprises multiple minimum data blocks, and by extracting some bits from each minimum data block to compose an index value corresponding to the first query unit, greatly reduce index matching time, increase index matching efficiency, and at the same time, make possible a significant reduction in the amount of memory occupied by indexes.

Description

数据处理方法、 装置及重删处理器 技术领域  Data processing method, device and deduplication processor
本发明实施例涉及存储技术, 尤其涉及一种数据处理方法、装置及重删处 理器。  Embodiments of the present invention relate to storage technologies, and in particular, to a data processing method, apparatus, and deduplication processor.
背景技术 Background technique
重复数据删除也称为智能压缩或单一实例存储,是一种可自动搜索重复数 据,将相同数据只保留唯一的一个副本, 并使用指向单一副本的指针替换掉其 他重复副本, 以达到消除冗余数据、 降低存储容量需求的存储技术。  Deduplication, also known as smart compression or single instance storage, is an automatic search for duplicate data, leaving only the same copy of the same data, and replacing other duplicates with pointers to a single copy to eliminate redundancy. Data, storage technology that reduces storage capacity requirements.
在现有技术中, 重复数据删除方法可以采用定长分块的算法。  In the prior art, the deduplication method can employ a fixed length blocking algorithm.
采用指纹算法计算滑动窗口内的数据对象的指纹,如果满足预定条件, 则 将该滑动窗口的开始位置和结束位置作为数据块的边界,通过不断滑动窗口并 计算指纹实现对数据对象的分块。对于每一次划分得到的数据块, 需要先判断 该数据块是否大于长度下限值, 若大于该长度下限值,再计算该数据块的指纹 值, 例如 Hash值, 与存储设备中存储的指纹值比较, 如果该数据块的指纹值与 存储设备中存储的某一指纹值相同, 则说明该数据块是重复数据块,存储设备 中已经存储了与该数据块相同的数据块, 因此, 该数据对象可以引用存储设备 中已存储的数据块, 如果存储设备中不存在与该数据块的指纹值相同的指纹 值, 则可以将该数据块及其指纹值存储到存储设备中, 以备后续的重复数据判 断。  The fingerprint algorithm is used to calculate the fingerprint of the data object in the sliding window. If the predetermined condition is met, the starting position and the ending position of the sliding window are used as the boundary of the data block, and the data object is segmented by continuously sliding the window and calculating the fingerprint. For each data block obtained by dividing, it is necessary to first determine whether the data block is greater than the length lower limit value, and if greater than the length lower limit value, calculate the fingerprint value of the data block, such as a hash value, and the fingerprint stored in the storage device. Value comparison, if the fingerprint value of the data block is the same as a certain fingerprint value stored in the storage device, it indicates that the data block is a duplicate data block, and the same data block as the data block has been stored in the storage device, therefore, The data object may refer to the stored data block in the storage device. If the fingerprint value of the data block does not exist in the storage device, the data block and its fingerprint value may be stored in the storage device for subsequent use. Repeated data judgment.
但是, 发明人发现, 现有技术重复数据删除方法中, 索引占用大量内存, 无法适用数据量逐渐增大的存储需求。 发明内容 However, the inventors have found that in the prior art deduplication method, the index occupies a large amount of memory, and the storage requirement in which the amount of data is gradually increased cannot be applied. Summary of the invention
本发明实施例提供一种数据处理方法、装置及重删处理器,降低内存占用, 满足数据量不断增大的需求。 第一方面, 本发明实施例提供一种数据处理方法, 所述方法应用于数据处 理系统, 所述数据处理系统包括重删处理器, 所述方法包括: 所述重删处理器将滑动窗口所覆盖的需要重复数据查找的数据作为第一 查询单位, 所述第一查询单位中包括多个最小数据块, 所述最小数据块为进行 重复数据查找的最小查询单位的数据块; 对所述第一查询单位中的数据进行 索引构造和重复数据查找; 所述索引构造, 包括: 从所述第一查询单位中每个最小数据块的指纹值中 分别抽取部分比特位,将抽取的比特位组成所述第一查询单位对应的一个预设 长度的索引值;  Embodiments of the present invention provide a data processing method, apparatus, and a deduplication processor, which reduce memory usage and meet the increasing demand for data. In a first aspect, an embodiment of the present invention provides a data processing method, where the method is applied to a data processing system, where the data processing system includes a deduplication processor, and the method includes: the deduplication processor will slide a window Covering data that requires repeated data search as a first query unit, the first query unit includes a plurality of minimum data blocks, and the minimum data block is a data block of a minimum query unit for performing repeated data search; The data in a query unit is indexed and the data is searched. The index structure includes: extracting a partial bit from each of the minimum data blocks in the first query unit, and extracting the extracted bits. An index value of a preset length corresponding to the first query unit;
所述重复数据查找, 包括: 在预先设置的索引表中查询是否有与所述第一 查询单位对应索引值相同的索引值,若在所述索引表中查询到与所述第一查询 单位对应索引值相同的第一索引值,则查找所述第一查询单位中的数据是否有 与所述第一索引值对应的数据存储地址指向的目标数据重复的数据。 结合第一方面,本发明实施例提供了第一种可能的实施方式,所述方法还包括: 若所述第一查询单位中的数据和所述第一索引值对应的数据存储地址指向的 目标数据完全重复: 将所述滑动窗口起始位置之前的数据,作为第二查询单位, 所述之前是针 对所述滑动窗口滑动的反方向而言,所述第二查询单位包括至少一个最小数据 块,根据所述第二查询单位中的所述至少一个最小数据块构造一个所述预设长 度的索引值,在所述索引表中查询是否有与所述第二查询单位对应索引值相同 的第二索引值; The repeated data search includes: querying, in a preset index table, whether there is an index value that is the same as the index value corresponding to the first query unit, and if the index table is queried to correspond to the first query unit And searching for the first index value with the same index value, and searching whether the data in the first query unit has data that is repeated with the target data pointed to by the data storage address corresponding to the first index value. With reference to the first aspect, the embodiment of the present invention provides a first possible implementation manner, where the method further includes: if the data in the first query unit and the data storage address corresponding to the first index value point to a target The data is completely repeated: the data before the start position of the sliding window is used as the second query unit, and the previous is the needle For the reverse direction of the sliding of the sliding window, the second query unit includes at least one minimum data block, and one index of the preset length is constructed according to the at least one minimum data block in the second query unit. a value, in the index table, whether there is a second index value that is the same as the index value corresponding to the second query unit;
若在所述索引表中查询到与所述第二查询单位对应索引值相同的第二索 引值,则查找所述第二查询单位中的数据中是否有与所述第二索引值对应的数 据存储地址指向的目标数据重复的数据。  If the second index value that is the same as the index value corresponding to the second query unit is queried in the index table, whether the data in the second query unit has data corresponding to the second index value is found. The data of the target data pointed to by the storage address is duplicated.
结合第一方面的第一种可能方式, 本发明实施例还提供了第二种可能方 式, 该方法还包括:  In conjunction with the first possible manner of the first aspect, the embodiment of the present invention further provides a second possible method, where the method further includes:
若在所述索引表中没有查询到与所述第二查询单位对应索引值相同的第 二索引值,则将所述第二查询单位对应的索引值与所述第二查询单位中的数据 的存储地址之间的对应关系插入到所述索引表中。 结合第一方面, 本发明实施例提供了第三种可能方式, 若所述第一查询单 位中数据和所述第一索引值对应的数据存储地址指向的目标数据不完全重复, 该方法还包括: 判断所述滑动窗口起始位置之前的数据大小是否达到所述第一查询单位 的大小, 所述之前是针对所述滑动窗口滑动的反方向而言, 如果否, 则以预设 的步长滑动所述滑动窗口,将滑动后所述滑动窗口内的数据作为一个第一查询 单位, 执行所述构造索引的步骤和所述重复数据查找的步骤。 结合第一方面或者第一方面的第一种可能方式或第一方面的第二种可能方式, 本发明实施例还提供了第四种可能方式,若在所述索引表中没有查询到与所述 第一查询单位对应索引值相同的第一索引值, 该方法还包括: If the second index value that is the same as the index value corresponding to the second query unit is not found in the index table, the index value corresponding to the second query unit is compared with the data in the second query unit. A correspondence between the storage addresses is inserted into the index table. With reference to the first aspect, the embodiment of the present invention provides a third possible manner, if the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely duplicated, the method further includes : determining whether the data size before the start position of the sliding window reaches the size of the first query unit, where the previous direction is for the opposite direction of the sliding window sliding, and if not, the preset step size The sliding window is slid, and the data in the sliding window after sliding is used as a first query unit, and the step of constructing the index and the step of searching for the repeated data are performed. In conjunction with the first aspect or the first possible manner of the first aspect or the second possible manner of the first aspect, the embodiment of the present invention further provides a fourth possible manner, if no query is found in the index table. Description The first query unit corresponds to the first index value with the same index value, and the method further includes:
判断所述滑动窗口起始位置之前的数据大小是否达到所述第一查询单位 的大小, 所述之前是针对所述滑动窗口滑动的反方向而言, 如果否, 则以预设 的步长滑动所述滑动窗口, 将滑动后所述滑动窗口内的数据作为第一查询单 位, 执行所述构造索引的步骤和所述重复数据查找的步骤。 结合第一方面的第四种可能方式, 本发明实施例还提供了第五种可能方 式,若在所述索引表中没有查询到与所述第一查询单位对应索引值相同的第一 索引值,在判断所述滑动窗口起始位置之前的数据的大小是否达到所述第一查 询单位的大小之前, 该方法还包括: 在所述索引表中查询与所述第一查询单位对应索引值匹配度等于或高于 预设的匹配度的第三索引值,若所述索引表中没有与所述第一查询单位对应索 引值匹配度等于或高于预设的匹配度的第三索引值,则将所述第一查询单位对 应的索引值中的数值的位置顺序进行排列组合,判断排列组合后的所述第一查 询单位对应的索引值在所述索引表中是否查找到所述第三索引值,如果没有找 到,则进入所述判断所述滑动窗口起始位置之前的数据是否达到所述第一查询 单位的大小的步骤。  Determining whether the size of the data before the start position of the sliding window reaches the size of the first query unit, where the previous direction is for the opposite direction of sliding of the sliding window, and if not, sliding at a preset step size The sliding window performs the step of constructing the index and the step of searching for the repeated data by using the data in the sliding window after sliding as the first query unit. With reference to the fourth possible manner of the first aspect, the embodiment of the present invention further provides a fifth possible manner, if the first index value that is the same as the index value corresponding to the first query unit is not queried in the index table. Before determining whether the size of the data before the start position of the sliding window reaches the size of the first query unit, the method further includes: matching, in the index table, the index value corresponding to the first query unit a third index value equal to or higher than a preset matching degree, if the index table does not have a third index value that matches the index value of the first query unit equal to or higher than a preset matching degree, Then, the positional values of the numerical values in the index values corresponding to the first query unit are sequentially arranged and combined, and it is determined whether the index value corresponding to the first query unit after the array combination is found in the index table. The index value, if not found, enters the step of determining whether the data before the start position of the sliding window reaches the size of the first query unit.
结合第一方面的第五种可能方式,本发明实施例提供了第六种可能的实施 方式,若所述索引表中查询到与所述第一查询单位对应的第一索引值匹配度等 于或高于预设的匹配度的第三索引值,则查找所述第一查询单位中的数据中是 否有与所述第三索引值对应的数据存储地址指向的目标数据重复的数据。 结合第一方面的第六种可能方式, 本发明实施例还提供了第七种可能方 式, 还包括: With reference to the fifth possible manner of the first aspect, the embodiment of the present invention provides a sixth possible implementation manner, if the first index value corresponding to the first query unit is found in the index table is equal to or If the third index value is higher than the preset matching degree, it is searched whether the data in the first query unit has data duplicated by the target data pointed to by the data storage address corresponding to the third index value. In conjunction with the sixth possible manner of the first aspect, the embodiment of the present invention further provides a seventh possible method. , also includes:
获得重复数据查找结果,若所述第一查询单位中的数据和所述第一索引值 对应的数据存储地址指向的目标数据不完全重复,则相对于所述第一索引值对 应的数据存储地址指向的目标数据, 对所述第一查询单位中的数据做 delta压 缩, 对完成 delta压缩后的数据进行存储, 将所述第一查询单位对应的索引值和 所述 delta压缩后的数据存储地址的对应关系插入到索引表中。 第二方面, 本发明实施例提供一种数据处理装置, 包括: 索引构造单元, 用于索引构造, 所述索引构造包括: 将需要重复数据查询的数据中在滑动窗口 所覆盖的数据作为一个第一查询单位,从所述第一查询单位中每个最小数据块 的指纹值中分别抽取部分比特位,将抽取的比特位组成所述第一查询单位对应 的一个预设长度的索引值, 其中, 所述第一查询单位中包括多个最小数据块, 所述最小数据块为进行重复数据查找的最小查询单位的数据块; 索引匹配单元,用于在预先设置的索引表中查询是否有与所述第一查询单 位对应索引值相同的索引值;  Obtaining a duplicate data search result, if the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely duplicated, the data storage address corresponding to the first index value Pointing target data, performing delta compression on the data in the first query unit, storing the data after the delta compression is completed, and the index value corresponding to the first query unit and the data storage address after the delta compression The corresponding relationship is inserted into the index table. In a second aspect, an embodiment of the present invention provides a data processing apparatus, including: an index construction unit, configured to: an index structure, where the index structure includes: data that is covered by a sliding window in data that requires repeated data query as a first a query unit, which extracts a partial bit from the fingerprint value of each of the smallest data blocks in the first query unit, and the extracted bits constitute an index value of a preset length corresponding to the first query unit, where The first query unit includes a plurality of minimum data blocks, where the minimum data block is a data block of a minimum query unit for performing repeated data search; and an index matching unit is configured to query whether the data is in a preset index table. The first query unit corresponds to an index value with the same index value;
重复数据查找单元,用于若所述索引匹配单元在所述索引表中查询到与所 述第一查询单位对应索引值相同的第一索引值,则查找所述第一查询单位中的 数据是否有与所述第一索引值对应的数据存储地址指向的目标数据重复的数 据。 结合第二方面, 本发明实施例提供第一种实施方式, 所述若重复数据查找 单元得到所述第一查询单位中的数据和所述第一索引值对应的数据存储地址 指向的目标数据完全重复, 则 所述索引构造单元,还用于将所述滑动窗口起始位置之前的数据,作为一 个第二查询单位, 所述之前是针对所述滑动窗口滑动的反方向而言,根据所述 第二查询单位中的所述至少一个最小数据块构造一个所述预设长度的索引值, 其中, 所述第二查询单位包括至少一个最小数据块; 所述索引匹配单元,还用于在所述索引表中查询是否有与所述第二查询单 位对应索引值相同的第二索引值; a data search unit, configured to: if the index matching unit searches for the first index value that is the same as the index value corresponding to the first query unit in the index table, whether the data in the first query unit is searched for There is data that is repeated with the target data pointed to by the data storage address corresponding to the first index value. With reference to the second aspect, the embodiment of the present invention provides a first implementation manner, where the repeated data searching unit obtains the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value. Repeat, then The index construction unit is further configured to use data before the start position of the sliding window as a second query unit, where the previous query is for the opposite direction of the sliding window sliding, according to the second query. The at least one minimum data block in the unit constructs an index value of the preset length, where the second query unit includes at least one minimum data block; the index matching unit is further used in the index table Whether the middle query has a second index value that is the same as the index value corresponding to the second query unit;
所述重复数据查找单元还用于,若所述索引匹配单元在所述索引表中查询 到与所述第二查询单位对应索引值相同的第二索引值,则查找所述第二查询单 位中的数据中是否有与所述第二索引值对应的数据存储地址指向的目标数据 重复的数据。 结合第二方面的第一种可能方式, 本发明实施例提供第二种可能方式, 若 所述索引匹配单元在所述索引表中没有查询到与所述第二查询单位对应索引 值相同的第二索引值, 则还包括: 第一存储单元, 用于存储所述第二查询单位中的数据;  The duplicate data searching unit is further configured to: if the index matching unit searches for the second index value that is the same as the index value corresponding to the second query unit in the index table, searching for the second query unit Whether there is data in the data that is duplicated by the target data pointed to by the data storage address corresponding to the second index value. With reference to the first possible manner of the second aspect, the embodiment of the present invention provides a second possible manner, if the index matching unit does not query the index value corresponding to the second query unit in the index table. The second index value further includes: a first storage unit, configured to store data in the second query unit;
第一索引表更新单元,用于将第二查询单位的数据对应的索引值与所述第 二查询单位中的数据的存储地址之间的对应关系插入到所述预先设置的索引 表中。 结合第二方面, 本发明实施例的数据处理装置还提供第三种可能方式, 若 所述重复数据查找单元查找到所述第一查询单位中数据和所述第一索引值对 应的数据存储地址指向的目标数据不完全重复, 该装置还包括: 第一判断单元,用于在所述重复数据查找单元查找得到所述第一查询单位 中数据和所述第一索引值对应的数据存储地址指向的目标数据不完全重复时, 判断所述滑动窗口起始位置之前的数据,所述之前是针对所述滑动窗口滑动的 反方向而言, 是否达到所述第一查询单位的大小; 第一指令单元, 用于在所述第一判断单元判断所述滑动窗口判断所述滑动 窗口起始位置之前的数据未达到所述第一查询单位的大小的时候,以预设的步 长滑动所述滑动窗口; The first index table updating unit is configured to insert a correspondence between an index value corresponding to the data of the second query unit and a storage address of the data in the second query unit into the preset index table. With reference to the second aspect, the data processing apparatus of the embodiment of the present invention further provides a third possible manner, if the duplicate data searching unit searches for data in the first query unit and a data storage address corresponding to the first index value. The pointing target data is not completely repeated, and the device further includes: a first determining unit, configured to find, by the duplicate data searching unit, the first query unit When the target data and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, determining data before the start position of the sliding window, where the previous direction is for the opposite direction of sliding of the sliding window Whether the size of the first query unit is reached; the first instruction unit is configured to: before the first determining unit determines that the sliding window determines the start position of the sliding window, the data does not reach the first query unit When the size is small, the sliding window is slid in a preset step size;
所述索引构造单元,还用于将滑动后所述滑动窗口内的数据作为一个第一 查询单位, 进行构造索引。 结合第一方面或者第一方面的第一种可能方式或者第一方面的第二种方 式, 本发明实施例的所述数据处理装置提供了第四种, 若所述索引匹配单元在 所述索引表中没有查询到与所述第一查询单位对应索引值相同的第一索引值, 该装置还包括: 第二判断单元,用于在所述重复数据查找单元查找得到所述第一查询单位 中数据和所述第一索引值对应的数据存储地址指向的目标数据不完全重复时, 判断所述滑动窗口起始位置之前的数据,所述之前是针对所述滑动窗口滑动的 反方向而言, 是否达到所述第一查询单位的大小; 第二指令单元, 用于在所述第二判断单元判断所述滑动窗口判断所述滑动 窗口起始位置之前的数据大小未达到所述第一查询单位的大小的时候,以预设 的步长滑动所述滑动窗口; 所述索引构造单元, 还用于将滑动后所述滑动窗口内的数据作为一个第一 查询单位, 进行构造索引。 结合第二方面的第四种方式, 本发明实施例还提供了第五种可能方式, 所 述索引匹配单元若在所述索引表中没有查询到与所述第一查询单位对应索引 值相同的第一索引值,还用于,在所述索引表中查询与所述第一查询单位对应 索引值匹配度等于或高于预设的匹配度的第三索引值;所述索引表中没有与所 述第一查询单位对应索引值匹配度等于或高于预设的匹配度的第三索引值,则 将所述第一查询单位对应的索引值中的数值的位置顺序进行排列组合,判断进 行排列组合后的所述第一查询单位对应的索引值在所述索引表中是否查找到 相同的第三索引值; The index construction unit is further configured to construct the index by using the data in the sliding window after sliding as a first query unit. With reference to the first aspect or the first possible manner of the first aspect or the second mode of the first aspect, the data processing apparatus of the embodiment of the present invention provides a fourth type, if the index matching unit is in the index The device does not query the first index value that is the same as the index value corresponding to the first query unit, and the device further includes: a second determining unit, configured to search, in the first query unit, by the duplicate data searching unit When the data and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, the data before the start position of the sliding window is determined, and the previous direction is for the opposite direction of sliding of the sliding window. Whether the size of the first query unit is reached; the second instruction unit is configured to: when the second determining unit determines that the sliding window determines the start position of the sliding window, the data size does not reach the first query unit The sliding window is slid in a preset step size; the index construction unit is further used to slide the sliding window The data is constructed as a first query unit. With reference to the fourth mode of the second aspect, the embodiment of the present invention further provides a fifth possible manner, if the index matching unit does not query the same index value corresponding to the first query unit in the index table. The first index value is further used to query, in the index table, a third index value that matches the index value corresponding to the first query unit with a matching degree equal to or higher than a preset matching degree; And the third query value corresponding to the index matching value of the first query unit is equal to or higher than the preset matching degree, and the positions of the numerical values in the index values corresponding to the first query unit are sequentially arranged and combined, and the judgment is performed. And determining, by the index value corresponding to the first query unit, whether the same third index value is found in the index table;
所述第二判断单元,还用于在所述索引匹配单元最终未能匹配到所述第三 索引值时, 判断所述滑动窗口起始位置之前的数据大小, 所述之前是针对所述 滑动窗口滑动的反方向而言,是否达到所述第一查询单位的大小, 并将结果发 送给所述第二指令单元。  The second determining unit is further configured to: when the index matching unit finally fails to match the third index value, determine a data size before the start position of the sliding window, where the previous is for the sliding Whether the size of the first query unit is reached in the reverse direction of the window sliding, and the result is sent to the second instruction unit.
结合第二方面的第五种可能方式, 本发明实施例提供了第六种可能方式 中, 所述重复数据查找单元,还用于若所述索引匹配单元在所述索引表中查询 到与所述第一查询单位对应的第一索引值匹配度等于或高于预设的匹配度的 第三索引值,则查找所述第一查询单位中的数据中是否有与所述第三索引值对 应的数据存储地址指向的目标数据重复的数据。 结合第二方面的第六种可能方式, 本发明实施例还提供了第七种可能方 式, 其中, 还包括: Delta压缩单元, 用于获取所述重复数据查找单元的重复数据查找结果, 若所述第一查询单位中的数据和所述第一索引值对应的数据存储地址指向的 目标数据不完全重复,则相对于所述第一索引值对应的数据存储地址指向的目 标数据, 对所述第一查询单位中的数据做 delta压缩; 所述第二存储单元还用于, 对完成 delta压缩后的数据进行存储; 所述第二索引表更新单元还用于 ,将所述第一查询单位对应的索引值和所 述 delta压缩后的数据存储地址的对应关系插入到索引表中。 With reference to the fifth possible manner of the second aspect, the embodiment of the present invention provides the sixth possible manner, the repeated data searching unit is further configured to: if the index matching unit queries the index table in the index table If the first index value corresponding to the first query unit is equal to or higher than the third index value of the preset matching degree, it is searched whether the data in the first query unit corresponds to the third index value. The data storage address points to the duplicate data of the target data. With reference to the sixth possible manner of the second aspect, the embodiment of the present invention further provides a seventh possible manner, where the method further includes: a delta compression unit, configured to obtain a duplicate data search result of the duplicate data search unit, if The data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely overlapped, and the data storage address corresponding to the first index value points to Marking data, performing delta compression on the data in the first query unit; the second storage unit is further configured to store data after the delta compression is completed; the second index table updating unit is further configured to: The correspondence between the index value corresponding to the first query unit and the data storage address after the delta compression is inserted into the index table.
本发明实施例 ,通过对数据主要按照第一查询单位的数据对应的索引值来 做索引, 第一查询单位包括了多个最小数据块,通过从每个最小数据块中取出 部分比特位来组成对应于第一查询单位的索引值, 大大缩减索引匹配时间,提 升了索引匹配效率,同时也使得大幅度减少索引对内存的占用称为可能。  In the embodiment of the present invention, the index is indexed by the index value corresponding to the data of the first query unit, and the first query unit includes a plurality of minimum data blocks, and is formed by taking some bits from each minimum data block. Corresponding to the index value of the first query unit, the index matching time is greatly reduced, the index matching efficiency is improved, and the memory occupancy of the index is also greatly reduced.
附图说明 DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施 例或现有技术描述中所需要使用的附图作一简单地介绍, 显而易见地, 下面描 述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出 创造性劳动性的前提下, 还可以根据这些附图获得其他的附图。  In order to more clearly illustrate the embodiments of the present invention or the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any inventive labor.
图 1-A为本发明实施例所提供的数据处理系统的结构示意图;  FIG. 1 is a schematic structural diagram of a data processing system according to an embodiment of the present invention; FIG.
图 1-B为本发明实施例所提供的数据处理系统的另一种结构示意图; 图 2为本发明实施例所提供的数据处理方法的流程图;  FIG. 1 is a schematic diagram of another structure of a data processing system according to an embodiment of the present invention; FIG. 2 is a flowchart of a data processing method according to an embodiment of the present invention;
图 2-A为本发明实施例所提供的索引构造示意图;  2A is a schematic diagram of an index structure provided by an embodiment of the present invention;
图 3为本发明实施例所提供的另一种数据处理方法的流程图;  3 is a flowchart of another data processing method according to an embodiment of the present invention;
图 4为本发明实施例所提供的数据处理装置的结构示意图;  4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;
图 5为本发明实施例所提供的数据处理装置的结构示意图;  FIG. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure;
图 6为本发明实施例提供的重删处理器的结构示意图。 具体实施方式 FIG. 6 is a schematic structural diagram of a deduplication processor according to an embodiment of the present invention. detailed description
为使本发明实施例的目的、技术方案和优点更加清楚, 下面将结合本发明 实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。基于本发明中 的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其 他实施例, 都属于本发明保护的范围。  The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
本发明实施例提供一种数据处理系统,数据处理系统包括至少一个重删处 理器和至少一个存储节点, 重删处理器和存储节点的部署方式, 本发明实施例 提供两种部署方式, 例如, 如图 1-A所示, 方式一: 每个重删处理器通过网络 与存储节点连接,重删处理器可以作为一个软件或一个独立的硬件设备或集成 在其他硬件设备上, 并部署在用户侧;  The embodiment of the present invention provides a data processing system, where the data processing system includes at least one de-duplication processor and at least one storage node, and a de-duplication processor and a storage node are deployed. As shown in Figure 1-A, mode 1: Each deduplication processor is connected to the storage node through a network. The deduplication processor can be deployed as a software or a separate hardware device or integrated on other hardware devices, and deployed in the user. side;
或者, 方式二, 如图 1-B所示, 重删处理器可以是作为硬件装置集成在存 储节点上, 也可以作为软件功能模块部署在存储节点上,接收用户发来的数据 后进行处理,本发明实施例中提供的一种数据处理系统 10包括至少一个重删处 理器 101, 102, ,10η,以及多个存储节点 111, 112, , 11η; 每个重删处理器 通过接口接收用户发来的数据, 所述接口可以是标准协议接口, 例如 NFS协议 接口。  Alternatively, as shown in FIG. 1-B, the de-duplication processor may be integrated on the storage node as a hardware device, or may be deployed as a software function module on the storage node, and processed after receiving the data sent by the user. A data processing system 10 provided in an embodiment of the present invention includes at least one deduplication processor 101, 102, 10n, and a plurality of storage nodes 111, 112, 11n; each deduplication processor receives a user transmission through an interface. The incoming data may be a standard protocol interface, such as an NFS protocol interface.
其中, 在本实施例中, 数据处理系统中会预先设置索引表, 所述索引表包 含有在所述存储节点中已经存储的数据对应的索引值与数据存储地址的对应 关系 ,所述索引表可以存储在每一个重删处理器中,也可以存储在存储节点中。  In this embodiment, an index table is preset in the data processing system, where the index table includes a correspondence between an index value corresponding to data already stored in the storage node and a data storage address, and the index table It can be stored in each deduplication processor or in a storage node.
数据处理系统中所述每个重删处理器和的每个存储节点连接,例如网络连 接或其他方式的连接。 图 2为本发明实施例所提供的数据处理方法的流程图,如图 2所示, 本发明 实施例方法应用于数据处理系统, 包括: Each of the deduplication processors and each storage node in the data processing system are connected, such as a network connection or other manner of connection. 2 is a flowchart of a data processing method according to an embodiment of the present invention. As shown in FIG. 2, the method of the embodiment of the present invention is applied to a data processing system, including:
步骤 201 , 重删处理器将需要重复数据查询的数据中在滑动窗口内的数据 作为一个额定大小数据块, 所述额定大小数据块包括多个最小数据块, 所述最 小数据块为进行重复数据查找的最小查询单位的数据块;将所述滑动窗口内的 一个额定大小数据块作为一个第一查询单位;  Step 201: The deduplication processor uses data in the sliding window in the data of the repeated data query as a rated size data block, the nominal size data block includes a plurality of minimum data blocks, and the minimum data block is used for performing repeated data. The data block of the smallest query unit found; a nominal size data block in the sliding window is used as a first query unit;
其中,在实际操作中, 最小数据块为在重复数据查找中的最小查询单位的 数据块, 通常为 4KB, 如果是变成分块, 最小查询单位大小为 4KB左右, 本发 明实施例对最小查询单位的数据块的大小不做限定; 为了便于描述,将滑动窗 口内的数据作为第一查询单位;  In the actual operation, the minimum data block is a data block of a minimum query unit in a repeated data search, and is usually 4 KB. If it is a variable component block, the minimum query unit size is about 4 KB, and the minimum query is in the embodiment of the present invention. The size of the data block of the unit is not limited; for convenience of description, the data in the sliding window is used as the first query unit;
步骤 202, 进行索引构造, 包括: 从所述第一查询单位中每个个最小数据 块的指纹值中分别抽取部分比特位,将抽取的比特位组成所述第一查询单位对 应的的一个预设长度的索引值;  Step 202: Perform an index construction, including: extracting, from the fingerprint values of each of the minimum data blocks in the first query unit, a partial bit, and forming the extracted bits into a pre-corresponding to the first query unit. Set the index value of the length;
本发明实施例在具体构造索引中,分别获取所述第一查询单位中每个最小 数据块的指纹值,从每个最小数据块对应的指纹值中抽取预设的相同个数的比 特位, 将所有抽取的比特位组成对应于所述第一查询单位的索引值; 如图 2-A 所示的索引构造示意图, 额定数据块的大小为 32KB, 包括了 8个 4KB的最小数 据块, 分别从每个最小数据块对应的指纹值中获取 5个比特位, 将所抽取的比 特位组合为一个长度为 40比特位的索引值。  In the specific configuration index, the fingerprint value of each of the smallest data blocks in the first query unit is obtained, and the preset number of the same number of bits is extracted from the fingerprint value corresponding to each minimum data block. All the extracted bits are grouped into an index value corresponding to the first query unit. As shown in FIG. 2-A, the size of the rated data block is 32 KB, including 8 minimum data blocks of 4 KB, respectively. Five bits are obtained from the fingerprint values corresponding to each of the smallest data blocks, and the extracted bits are combined into an index value of 40 bits.
其中, 对从每个最小数据块中是否抽取相同位数的比特位, 或者, 对抽取 到的比特位如何来组成索引, 本发明实施例不做限定, 只需要根据设定的索引 的长度来灵活确定,而索引的长度根据实际情况设定,本发明实施例不做限定。 步骤 203 , 在预先设置的索引表中查询是否有与所述第一查询单位对应索 引值相同的索引值,若在所述索引表中查询到与所述第一查询单位对应索引值 相同的第一索引值, 则进入步骤 204; The embodiment of the present invention is not limited to whether the bit of the same number of bits is extracted from each of the smallest data blocks, or the number of the extracted bits is used to form an index, and only needs to be based on the length of the set index. It is determined flexibly, and the length of the index is set according to the actual situation, which is not limited in the embodiment of the present invention. Step 203: Query whether there is an index value that is the same as the index value corresponding to the first query unit in a preset index table, and if the index value corresponding to the first query unit is the same in the index table, An index value, proceeds to step 204;
其中, 便于后续描述, 将第一查询单位在索引表中匹配到相同的索引值称 为第一索引值;  For facilitating the subsequent description, the first query unit is matched to the same index value in the index table as the first index value;
根据索引表所存储的位置不同,重删处理器在进行步骤 203中索引查询的具 体步骤有所不同:  Depending on where the index table is stored, the specific steps of the deduplication processor in performing the index query in step 203 are different:
当预先设置的索引表存储在重删处理器中时, 重删处理器可在本地的索引 表中查询是否有与所述第一查询单位对应索引值相同的第一索引值索引值,并 获得查询结果;  When the pre-set index table is stored in the deduplication processor, the deduplication processor may query, in the local index table, whether there is a first index value index value that is the same as the index value corresponding to the first query unit, and obtain search result;
当预先设置的索引表存储在存储节点中, 重删处理器可将第一查询单位对 应的索引值发送给存储节点,由存储节点在索引表中查询是否有与所述第一查 询单位对应的索引值相同的第一索引值,所述重删处理器接收存储节点反馈的 查询结果。  When the pre-set index table is stored in the storage node, the de-duplication processor may send the index value corresponding to the first query unit to the storage node, and the storage node queries the index table whether there is a corresponding to the first query unit. The first index value with the same index value, the deduplication processor receives the query result fed back by the storage node.
步骤 204,则查找所述第一查询单位中的数据的大小是否有与所述第一索引 值对应的数据存储地址指向的目标数据重复的数据;  Step 204: Search for data of the size of the data in the first query unit that has duplicated target data pointed to by the data storage address corresponding to the first index value.
其中, 步骤 204中具体查找的方法, 可以是: 将所述第一索引值对应的数据 存储地址指向的目标数据加载到所述重删处理器与所述第一查询单位中包括 的数据查找重复数据或者将所述第一查询单位中包括的数据发送到所述目标 数据所在存储节点与所述目标数据查找重复数据,也可以是将第一索引值对应 的数据存储地址指向的目标数据对应的指纹值和所述第一查询单位中包括的 数据对应的指纹值进行比较, 以查找重复数据, 具体方式本发明实施例不做限 定。 The method for specifically searching in step 204 may be: loading target data pointed to by the data storage address corresponding to the first index value into a data search repeat included in the deduplication processor and the first query unit Transmitting the data included in the first query unit to the storage node where the target data is located and the target data to search for duplicate data, or may be corresponding to the target data pointed to by the data storage address corresponding to the first index value. The fingerprint value is compared with the fingerprint value corresponding to the data included in the first query unit to find the duplicate data, and the specific manner is not limited in the embodiment of the present invention. Set.
本发明实施例中, 额定大小数据块中包括了多个最小数据块, 将多个最小 数据块构造一个索引, 大幅度减少了索引的数量。  In the embodiment of the present invention, the rated size data block includes a plurality of minimum data blocks, and the plurality of minimum data blocks are constructed into an index, which greatly reduces the number of indexes.
为减少出现最小数据块的偏移情况, 本方法实施例中, 在步骤 203中若没 能在所述索引表中查询到与所述第一查询单位对应索引值相同的第一索引值, 则进入步骤 205;  In order to reduce the occurrence of the offset of the smallest data block, in the method embodiment, if the first index value that is the same as the index value corresponding to the first query unit is not found in the index table, Go to step 205;
步骤 205, 判断滑动窗口起始位置之前的数据的大小是否达到所述额定大 小数据块的大小, 这里的之前是往滑动窗口滑动的反方向而言的, 如果否, 则 进入步骤 206; 如果是, 则进入步骤 207;  Step 205, determining whether the size of the data before the start position of the sliding window reaches the size of the nominal size data block, where the previous direction is to the opposite direction of sliding of the sliding window, and if not, proceeding to step 206; , then proceeds to step 207;
本方法实施例中,在每一次对数据进行重复查找之后,如果被查询的数据 中某一部分和已经存储的数据重复, 则需要将被查询的数据进行数据删除,在 所述已经存储的数据增加引用计数等等操作,如果被查询的数据中某一部分被 确定为新数据, 则将这部分数据进行存储,被查询的数据中的一部分数据被删 除或者被存储使得在被查询的数据中出现了数据断点,本发明实施例涉及到滑 动窗口起始位置之前的描述中, 所述的之前是往滑动窗口滑动的反方向而言 的。  In the embodiment of the method, after each data is repeatedly searched, if a part of the data to be queried and the already stored data are duplicated, the data to be queried needs to be deleted, and the data already stored is increased. Reference counting and the like, if a part of the data to be queried is determined to be new data, the part of the data is stored, and part of the data in the queried data is deleted or stored so that it appears in the queried data. The data breakpoint, in the description of the embodiment before the sliding window starting position, is preceded by the opposite direction of sliding of the sliding window.
步骤 206, 以预设的步长滑动所述滑动窗口, 将滑动后所述滑动窗口内的 数据作为一个第一查询单位, 返回步骤 202;  Step 206, sliding the sliding window in a preset step, the data in the sliding window after sliding as a first query unit, return to step 202;
步骤 207, 将所述滑动窗口起始位置之前的数据作为新数据进行存储, 所 述之前是针对所述滑动窗口滑动的反方向而言;  Step 207: Store data before the start position of the sliding window as new data, which is previously for the opposite direction of sliding of the sliding window;
因为在所述滑动窗口中额定大小数据块在索引表中没有查到相同的索引, 也就是滑动窗口中的数据找不到重复数据了,这时为了避免存储的数据中可能 只是与滑动窗口中的数据有一定长度的偏移, 因此, 本发明实施例将滑动窗口 滑动一个步长之后再进行查找,与所述滑动窗口之前的数据大小达到了额定大 小的数据块, 那么, 就可以知道滑动窗口已经滑动了一个额定大小的数据块的 长度了, 而这个长度的数据也正是之前滑动窗口所覆盖的数据, 因为之前滑动 窗口覆盖这些数据时已经判断过这些数据对应的索引在索引表中找不到相同 的索引, 因此, 此时, 就可以直接将所述滑动窗口之前的数据作为新数据, 直 接存储了, 这里的之前是就滑动窗口滑动反方向而言; Because the rated size data block in the sliding window does not find the same index in the index table, that is, the data in the sliding window cannot find duplicate data, in order to avoid possible storage of data There is only a certain length of offset from the data in the sliding window. Therefore, in the embodiment of the present invention, the sliding window is slid by one step and then searched, and the data size before the sliding window reaches the data block of the rated size, then , you can know that the sliding window has been slid by the length of a data block of the rated size, and the data of this length is also the data covered by the previous sliding window, because the previous sliding window covers the data and has already judged the corresponding data. The index can not find the same index in the index table, therefore, at this time, the data before the sliding window can be directly stored as new data, where the sliding direction is in front of the sliding window;
步骤 208, 将得到的所述新数据对应的索引值与所述新数据存储地址的对 应关系插入到所述索引表中。  Step 208: Insert the obtained index value corresponding to the new data and the corresponding relationship of the new data storage address into the index table.
其中, 如前面所述, 所述新数据就是之前滑动窗口所覆盖过的数据, 在滑 动窗口覆盖这些数据时, 已经计算过这些数据对应的索引值,如果这些索引值 有保存, 这时就可以直接获取所述新数据的索引值了; 如果没有保存, 则可以 按照前述获取第一查询单位对应索引的方法获得所述新数据对应的索引。 其 中, 索引表在最初可以是空白的, 在后续重复数据查找过程中, 通过不断插入 新数据对应的索引值与所述新数据存储地址的对应关系而不断更新,这里的新 数据也就是查找得到的不重复的数据。  Wherein, as mentioned above, the new data is the data covered by the previous sliding window, and when the sliding window covers the data, the index value corresponding to the data has been calculated, and if the index values are saved, then Obtaining the index value of the new data directly; if not, the index corresponding to the new data may be obtained according to the foregoing method for obtaining the corresponding index of the first query unit. The index table may be blank at the beginning, and is continuously updated by continuously inserting the corresponding relationship between the index value corresponding to the new data and the new data storage address in the subsequent repeated data search process, where the new data is also found. Non-repeating data.
本发明实施例中,根据滑动窗口内的多个最小数据块构造一个索引,因此, 当滑动窗口内的数据,也就是第一查询单位内的数据在索引表中匹配到相同的 第一索引值时,需要将所述第一索引值对应的数据地址指向的数据和第一查询 单位内的数据进行比较,从而确定第一查询单位内的数据是否和已经存储的数 据重复。数据之前的比较, 可以采用现有技术中通过数据对应的指纹值进行比 较来判断数据是否重复。 本发明实施例中,当第一查询单位中的数据的指纹值和所述目标数据对应 的指纹值之间不能完全重复时,为了避免是因为存储的数据中可能只是与滑动 窗口中的数据有一定的偏移, 因此, 当不能完全重复时, 本发明实施例将滑动 窗口滑动一个步长之后再进行查找, 因此, 本发明实施例中步骤 204之后还可 以包括: In the embodiment of the present invention, an index is constructed according to a plurality of minimum data blocks in the sliding window. Therefore, when the data in the sliding window, that is, the data in the first query unit, matches the same first index value in the index table. When the data pointed to by the data address corresponding to the first index value is compared with the data in the first query unit, it is determined whether the data in the first query unit is duplicated with the already stored data. Before the comparison of the data, it is possible to judge whether the data is repeated by comparing the fingerprint values corresponding to the data in the prior art. In the embodiment of the present invention, when the fingerprint value of the data in the first query unit and the fingerprint value corresponding to the target data cannot be completely overlapped, the reason for avoiding is that the stored data may only be related to the data in the sliding window. A certain offset, and therefore, when it is not completely repetitive, the embodiment of the present invention slides the sliding window by one step and then performs the search. Therefore, after the step 204 in the embodiment of the present invention, the method may further include:
步骤 209, 判断所述第一查询单元中数据的指纹值是否和所述目标数据的 指纹值完全相同, 如果是, 则进入步骤 210, 如果否, 则进入步骤 205;  Step 209, it is determined whether the fingerprint value of the data in the first query unit is exactly the same as the fingerprint value of the target data, if yes, then proceeds to step 210, and if not, proceeds to step 205;
因为在所述滑动窗口中的数据如果没有和所述目标数据完全重复的话,这 时为了避免存储的数据中可能只是与滑动窗口中的数据有一定长度的偏移,因 此, 本发明实施例将滑动窗口滑动一个步长之后再进行查找, 与滑动窗口起始 位置之前的数据大小达到了额定大小的数据块, 那么, 就可以知道滑动窗口已 经滑动了一个额定大小的数据块的长度了,而这个长度的数据也正是之前滑动 窗口所覆盖的数据,因为之前滑动窗口覆盖这些数据时已经判断过这些数据与 所述目标数据不完全重复, 因此, 此时, 就可以直接将目前滑动窗口起始位置 之前的数据作为新数据, 直接存储了, 因此, 当判断所述第一查询单元中数据 的指纹值不和所述目标数据的指纹值完全相同时, 可以直接进入步骤 205; 步骤 210, 将第一查询单位中的数据进行重复数据删除。  Because the data in the sliding window does not completely overlap with the target data, in order to avoid that the stored data may only have a certain length of offset from the data in the sliding window, the embodiment of the present invention will After the sliding window slides for one step and then searches, and the data size before the start position of the sliding window reaches the rated size of the data block, it can be known that the sliding window has been slid by the length of the data block of the rated size, and The data of this length is also the data covered by the previous sliding window, because the previous sliding window covers the data and it has been judged that the data does not completely overlap with the target data, so at this time, the current sliding window can be directly The data before the start position is directly stored as the new data. Therefore, when it is determined that the fingerprint value of the data in the first query unit is not exactly the same as the fingerprint value of the target data, the process may directly proceed to step 205; Deduplicate the data in the first query unit.
当第一查询单位中的数据指纹值和所述目标数据的指纹值完全重复时,就 确定第一查询单位中的数据为重复数据,对重复数据进行重复数据删除的具体 方法可参见现有技术。  When the data fingerprint value in the first query unit and the fingerprint value of the target data are completely repeated, it is determined that the data in the first query unit is duplicate data, and the specific method for performing data deletion on the duplicate data may refer to the prior art. .
需要注意的是, 当第一查询单位中的数据被确定为重复数据后, 进行了重 复数据删除, 这时需要将滑动窗口之前的数据进行处理, 无论滑动窗口之前的 数据的大小是否达到额定大小数据块的大小, 因此, 本发明实施例在步骤 209 之后, 还包括: It should be noted that when the data in the first query unit is determined to be duplicate data, the data is deleted. In this case, the data before the sliding window needs to be processed, regardless of the previous window. Whether the size of the data reaches the size of the data block of the rated size. Therefore, after the step 209, the embodiment of the present invention further includes:
步骤 211 , 将所述滑动窗口起始位置之前数据作为一个第二查询单位, 这 里的之前是就所述滑动窗口滑动的反方向而言,所述第二查询单位包括至少一 个最小数据块,根据所述第二查询单位中的所述至少一个最小数据块构造一个 长度和第一查询单位对应的索引值长度相同的索引值,在所述索引表中查询是 否有与所述第二查询单位对应索引值相同的第二索引值, 若是, 则进入步骤 212; 若否, 则进入步骤 213;  Step 211: The data before the start position of the sliding window is used as a second query unit, where the second query unit includes at least one minimum data block, in the reverse direction of the sliding window. The at least one minimum data block in the second query unit constructs an index value having the same length as the index value corresponding to the first query unit, and querying in the index table whether the second query unit corresponds to the second query unit. a second index value with the same index value, and if so, proceeds to step 212; if not, proceeds to step 213;
其中, 所述的第二查询单位中, 可能只包括一个最小数据块, 例如 4KB大 小的数据, 这时, 同样需要根据这一个最小数据块来构造一个与所述第一查询 单位的索引值的长度相同的索引值;  The second query unit may include only one minimum data block, for example, 4 KB data. In this case, it is also necessary to construct an index value corresponding to the first query unit according to the minimum data block. Index values of the same length;
如果所述第二查询单位中, 包括多个最小数据块, 需要根据所包括的所有 的最小数据块来构造和所述第一查询单位对应的索引值长度相同的索引值,例 如, 如果索引值的长度需要 40比特位, 有两个最小数据块, 那么需要从每个最 小数据块的指纹值中获取 20比特位来组成 40比特位的索引值; 步骤 212, 查找所述第二查询单位中的数据中是否有与所述第二索引值对 应的数据存储地址指向的目标数据重复的数据。 其中, 步骤 212中, 查找重复数据方法可以是: 将所述第二索引值对应的 数据存储地址指向的目标数据加载到所述重删处理器后与所述第二查询单位 中包括的数据比较以查询重复数据或者将所述第二查询单位中的包括的数据 发送到所述第二索引值对应的数据存储地址所在以查询重复数据;也可以是将 第一索引值对应的数据存储地址指向的目标数据对应的指纹值和所述第一查 询单位中包括的数据对应的指纹值进行比较, 以查找重复数据。 步骤 213 , 将所述第二查询单位中的数据作为新数据进行存储, 将所述第 二查询单位对应的索引值与所述第二查询单位中的数据存储地址之间的对应 关系插入索引表中。 If the second query unit includes a plurality of minimum data blocks, it is required to construct an index value having the same length as the index value corresponding to the first query unit according to all the minimum data blocks included, for example, if the index value The length needs 40 bits, there are two minimum data blocks, then 20 bits are needed to obtain the 40-bit index value from the fingerprint value of each minimum data block; Step 212, find the second query unit Whether there is data in the data that is duplicated by the target data pointed to by the data storage address corresponding to the second index value. The method for searching for the duplicate data in step 212 may be: comparing the target data pointed to by the data storage address corresponding to the second index value to the data included in the second query unit after loading the deduplication processor Querying the duplicate data or sending the data included in the second query unit to the data storage address corresponding to the second index value to query the duplicate data; or the data storage address corresponding to the first index value may be pointed to Fingerprint value corresponding to the target data and the first check The fingerprint values corresponding to the data included in the unit are compared to find duplicate data. Step 213: The data in the second query unit is stored as new data, and the correspondence between the index value corresponding to the second query unit and the data storage address in the second query unit is inserted into the index table. in.
本发明实施例中,通过对多个最小数据分块构造一个索引值,在对数据进 行引用计数,压缩等以额定大小数据块为单位 艮大程度上减少了内存的占用, 并且重复数据查找过程中, 考虑数据偏移造成的重删率降低,将额定大小数据 块和小于额定大小的数据块混合的形式进行重复数据的查找, 提高重删率。  In the embodiment of the present invention, by constructing an index value for a plurality of minimum data partitions, the data is reference counted, compressed, etc., in units of rated size data blocks, the memory occupation is greatly reduced, and the data search process is repeated. In the case of reducing the deduplication rate caused by the data offset, the repeated data is searched in a form in which the data block of the rated size and the data block smaller than the rated size are mixed, and the deduplication rate is improved.
参见图 3 , 本发明实施例还提供另一种数据处理方法, 与图 2所对应的方法 实施例的相同部分的流程的说明参见图 2所对应的实施例,与图 2对应的方法实 施例不同之处在于, 在索引表中没有查找到相同索引值时, 为了提高查找到相 同的索引的概率, 对索引值进行改变后, 继续查找。 图 3所述的数据处理方法, 包括:  Referring to FIG. 3, an embodiment of the present invention further provides another data processing method. For the description of the flow of the same part of the method embodiment corresponding to FIG. 2, refer to the embodiment corresponding to FIG. 2, and the method embodiment corresponding to FIG. The difference is that when the same index value is not found in the index table, in order to improve the probability of finding the same index, after the index value is changed, the search is continued. The data processing method described in FIG. 3 includes:
步骤 301 , 重删处理器将需要重复数据查询的数据中在滑动窗口内的数据 作为一个额定大小数据块, 所述额定大小数据块包括多个最小数据块, 所述最 小数据块为进行重复数据查找的最小查询单位的数据块;将所述滑动窗口内的 数据作为一个第一查询单位;  Step 301: The deduplication processor uses data in the sliding window in the data of the repeated data query as a nominal size data block, the nominal size data block includes a plurality of minimum data blocks, and the minimum data block is used to perform repeated data. The data block of the smallest query unit found; the data in the sliding window is used as a first query unit;
步骤 302, 进行索引构造, 包括: 根据所述第一查询单位中的多个最小数 据块构造对应的所述第一查询单位的一个预设长度的索引值;  Step 302: Perform an index construction, including: constructing, according to the plurality of minimum data blocks in the first query unit, a preset length index value of the first query unit;
步骤 303 , 在预先设置的索引表中查询是否有与所述第一查询单位对应索 引值相同的第一索引值,若在所述索引表中查询到与所述第一查询单位对应索 引值相同的第一索引值, 则进入步骤 308; 若查询不到, 则进入步骤 304; 步骤 304, 在所述索引表中查询与所述第一查询单位对应索引值匹配度等 于或高于预设的匹配度的第三索引值, 若查询到, 则进入步骤 308; 若查询不 到, 则进入步骤 306; Step 303: Query whether there is a first index value that is the same as the index value corresponding to the first query unit, and if the index value corresponding to the first query unit is the same in the index table, The first index value, then proceeds to step 308; if not, then proceeds to step 304; Step 304: Query, in the index table, a third index value that matches the index value corresponding to the first query unit with a matching degree equal to or higher than a preset matching degree. If yes, go to step 308; Go to step 306;
需要说明的是, 获取第三索引值和获取第一索引值,在实际操作中可以是 一个步骤完成, 本发明实施例为了描述更清楚,从逻辑上分别写在了两个步骤 中;  It should be noted that obtaining the third index value and obtaining the first index value may be completed in one step in the actual operation, and the embodiment of the present invention is logically written in two steps for the sake of clearer description;
步骤 306, 判断是否超过一个排列组合周期, 如果没有超过, 则进入步骤 307; 如果超过, 进入步骤 309;  Step 306, it is determined whether more than one permutation combination period, if not exceeded, then proceeds to step 307; if exceeded, proceeds to step 309;
步骤 307, 将所述第一查询单位对应的索引值中的数据的位置顺序进行排 列组合之后, 返回步骤 303; 本发明实施例中,当第一查询单位对应的索引值没有在索引表中查找到相 同的索引值, 为了提高查找到相同索引值的概率,将第一查询单位对应的索引 值中的数据的位置顺序进改变, 改变的方式有多种, 本发明实施例采用了排列 组合的方式; 具体的数据位置顺序的改变方式可以是:将所述第一查询单位对应的索引 值划分为多个部分,将所述多个部分在所述第一查询单位对应的索引值中的位 置顺序进行第一次排列组合;判断排列组合后的所述第一查询单位对应的索引 值在所述索引表中是否查找到匹配度等于或高于预设的匹配度第三索引值,如 果没有找到 ,则将所述多个部分在所述第一查询单位对应的索引值中的位置顺 序进行第二次排列组合;继续判断排列组合后的所述第一查询单位对应的索引 值在所述索引表中是否查找到相同的第三索引值;  Step 307: After the positions of the data in the index value corresponding to the first query unit are sequentially arranged and combined, the process returns to step 303. In the embodiment of the present invention, when the index value corresponding to the first query unit is not found in the index table, To the same index value, in order to improve the probability of finding the same index value, the position of the data in the index value corresponding to the first query unit is changed in sequence, and there are various ways of changing, and the embodiment of the present invention adopts the arrangement and combination. The specific data location order may be changed by dividing the index value corresponding to the first query unit into multiple parts, and the positions of the multiple parts in the index value corresponding to the first query unit. Performing the first arrangement of the first order; determining whether the index value corresponding to the first query unit after the combination is found in the index table, whether the matching degree is equal to or higher than the preset matching degree third index value, if not If found, the position of the plurality of parts in the index value corresponding to the first query unit is sequentially arranged in a second time; Continuing to determine whether the index value corresponding to the first query unit after the permutation combination finds the same third index value in the index table;
当排列组合后的所述第一查询单位对应的索引值在所述索引表中查找到 相同的第三索引值或者完成了所有可能的排列组合, 则停止排列组合; 步骤 308 , 查找所述第一查询单位中的数据是否有与所述第一索引值对应 的数据存储地址指向的目标数据重复的数据; 步骤 309, 判断所述滑动窗口起始位置之前的数据, 所述之前是针对所述 滑动窗口滑动的反方向而言, 是否达到所述额定大小数据块的大小, 如果否, 则进入步骤 310; 如果是, 则进入步骤 311 ; When the index value corresponding to the first query unit after the combination is found, the index value is found in the index table. If the same third index value is completed or all possible permutation combinations are completed, the permutation combination is stopped. Step 308: Find whether the data in the first query unit has a target pointed to by the data storage address corresponding to the first index value. Data multiplexed; step 309, determining data before the start position of the sliding window, whether the size of the nominal size data block is reached for the reverse direction of the sliding window sliding, if not, Then proceed to step 310; if yes, proceed to step 311;
步骤 301 , 以预设的步长滑动所述滑动窗口, 将滑动后所述滑动窗口内的 数据作为一个第一查询单位, 返回步骤 302;  Step 301, sliding the sliding window in a preset step, the data in the sliding window after sliding as a first query unit, return to step 302;
步骤 311 ,将所述滑动窗口起始位置之前的所述数据作为新数据进行存储; 步骤 312 , 将得到的所述新数据对应的索引值与所述新数据存储地址的对 应关系插入到所述索引表中。  Step 311: The data before the start position of the sliding window is stored as new data. Step 312: Insert the obtained correspondence between the index value corresponding to the new data and the new data storage address into the In the index table.
图 3所对应的实施例中, 再索引表中查找索引时, 不要求查找到完全相同 的索引, 只要求索引值的匹配度高于或等于预设的匹配度, 例如, 预设索引值 的数据有 70%相同, 只有 30% 不相同, 这样就从索引表中挑选出认为和第一 查询单位的索引值相似度较高的索引值,将相似性稿的索引值在索引表中对应 的数据地址的目标数据与第一查询单位中的数据进行比较查询重复数据。 因 此, 该方法, 在步骤 308之后还可以包括:  In the embodiment corresponding to FIG. 3, when searching for an index in the re-index table, it is not required to find an identical index, and only the matching degree of the index value is required to be higher than or equal to a preset matching degree, for example, a preset index value. The data is 70% identical, only 30% is different, so the index value that is considered to be similar to the index value of the first query unit is selected from the index table, and the index value of the similarity draft is corresponding in the index table. The target data of the data address is compared with the data in the first query unit to query the duplicate data. Therefore, the method may further include after step 308:
步骤 313, 判断所述第一查询单位中的数据是否和与查询到的索引值对应 的数据地址指向的数据完全重复; 如果不完全重复, 则进入步骤 314;  Step 313, it is determined whether the data in the first query unit and the data pointed to by the data address corresponding to the index value are completely repeated; if not completely repeated, proceed to step 314;
具体如何判断是否完全重复的方法可通过指纹值的比较来判断。  The specific method of judging whether or not to completely repeat can be judged by comparing the fingerprint values.
步骤 314, 根据所述查询到的索引值对应的数据地址指向的数据, 对所述 第一查询单位中的数据进行 delta压缩 ( delta compression ) ; Step 314: According to the data pointed to by the data address corresponding to the queried index value, The data in the first query unit is delta compressed;
其中具体的 delta压缩算法可以有 zdelt或 vcdiff或 xdelta等算法, 本发明实施 例不做限制;  The specific delta compression algorithm may have an algorithm such as zdelt or vcdiff or xdelta, which is not limited in the embodiment of the present invention;
步骤 315 , 将进行 delta压缩后得到的数据进行存储, 并将所述第一查询单 位对应的索引值与所述 delta压缩后得到的数据的存储地址之间的对应关系插 入到索引表中。  Step 315: Store data obtained by performing delta compression, and insert a correspondence between an index value corresponding to the first query unit and a storage address of the data obtained by the delta compression into the index table.
本发明实施例中 ,如果所述第一查询单位中的数据和与查询到的索引值对 应的数据地址指向的数据完全重复, 则方法还可以包括:  In the embodiment of the present invention, if the data in the first query unit and the data pointed to by the data address corresponding to the queried index value are completely duplicated, the method may further include:
步骤 316, 将第一查询单位中的数据进行重复数据删除;  Step 316: Perform data deletion on the data in the first query unit.
步骤 317, 将所述滑动窗口起始位置之前的数据作为一个第二查询单位; 根据所述第二查询单位中的所述至少一个最小数据块构造一个长度和第一查 询单位对应的索引值长度相同的索引值,在所述索引表中查询是否有与所述第 二查询单位对应索引值相同的第二索引值, 若是, 则进入步骤 318; 若否, 则 进入步骤 319; 步骤 318, 查找所述第二查询单位中的数据中是否有与所述第二索引值对 应的数据存储地址指向的目标数据重复的数据。 其中, 步骤 318中查找重复数据可以是: 将所述第二索引值对应的数据存 储地址指向的目标数据加载到所述重删处理器后与所述第二查询单位中包括 的数据比较以查询重复数据或者将所述第二查询单位中的包括的数据发送到 所述第二索引值对应的数据存储地址所在以查询重复数据,也可以是将第一索 引值对应的数据存储地址指向的目标数据对应的指纹值和所述第一查询单位 中包括的数据对应的指纹值进行比较, 以查找重复数据。 在步骤 318做完重复数据查询之后, 如果数据完全重复则将第二查询单位 的数据进行重复数据删除, 如果数据不完全重复, 则可进行 delta压缩或者直接 将第二查询单位的数据作为新数据进行存储, 本发明不做限定, 本发明实施例 以将不完全重复的数据作为新数据为例。 Step 317: The data before the start position of the sliding window is used as a second query unit. The length of the index value corresponding to the first query unit is constructed according to the at least one minimum data block in the second query unit. The same index value, in the index table, query whether there is a second index value that is the same as the index value corresponding to the second query unit, and if yes, go to step 318; if no, go to step 319; Step 318, find Whether there is data in the data in the second query unit that is duplicated by the target data pointed to by the data storage address corresponding to the second index value. The finding the duplicate data in the step 318 may be: loading the target data pointed to by the data storage address corresponding to the second index value into the deduplication processor, and comparing with the data included in the second query unit to query Deduplicating the data or sending the data included in the second query unit to the data storage address corresponding to the second index value to query the duplicate data, or may be the target that points the data storage address corresponding to the first index value The fingerprint value corresponding to the data is compared with the fingerprint value corresponding to the data included in the first query unit to find duplicate data. After the repeated data query is completed in step 318, if the data is completely repeated, the data of the second query unit is deduplicated. If the data is not completely repeated, delta compression may be performed or the data of the second query unit may be directly used as new data. For the storage, the present invention is not limited. The embodiment of the present invention takes as an example the new data is not completely repeated.
步骤 319, 将所述第二查询单位中的数据作为新数据进行存储, 将所述第 二查询单位对应的索引值与所述第二查询单位中的数据存储地址之间的对应 关系插入索引表中。  Step 319: The data in the second query unit is stored as new data, and the correspondence between the index value corresponding to the second query unit and the data storage address in the second query unit is inserted into the index table. in.
本方法实施例所提供的数据处理方法, 针对包含了多个最小查询单位的额 定大小数据块来做索引进行查找, 大大减少了内存占用, 进一步, 本发明实施 例通过对滑动窗口按照最小查询单位进行滑动后,再进行判断,避免数据查找 中出现偏移导致的重删率降低, 实现混合粒度的重复数据查找,提高了重删率 的同时降低了内存占用。  The data processing method provided by the embodiment of the present invention performs index searching for the rated size data block including the plurality of minimum query units, thereby greatly reducing the memory occupation. Further, the embodiment of the present invention adopts the minimum query unit by sliding the window. After sliding, the judgment is made to avoid the deduplication rate caused by the offset in the data search, and the repeated data search of the mixed granularity is realized, and the deduplication rate is improved while the memory occupation is reduced.
本发明实施例还提供一种数据处理装置,用于执行上述实施例中提供的方 法, 其实现的原理和技术效果同本发明实施例所提供的方法类似, 可参考方法 实施例中的描述, 所述数据处理装置在具体实现时可以是一个重删处理器,也 可以是任何执行相同功能的装置, 例如安装有重删处理器的存储节点,应用于 数据处理系统中, 所述数据处理系统包括所述数据处理装置和存储节点, 所述 数据处理装置和所述存储节点通信;  The embodiment of the present invention further provides a data processing apparatus, which is used to perform the method provided in the foregoing embodiment. The principle and the technical effect of the implementation are similar to the method provided by the embodiment of the present invention. The data processing device may be a deduplication processor in a specific implementation, or may be any device that performs the same function, such as a storage node installed with a deduplication processor, and is applied to a data processing system, the data processing system. Including the data processing device and a storage node, the data processing device communicating with the storage node;
参见图 4, 本发明实施例提供了一种数据处理装置结构示意图, 包括: 索引构造单元 401 , 用于索引构造, 所述索引构造包括: 将需要重复数据 查询的数据中在滑动窗口所覆盖的数据作为一个第一查询单位,从所述第一查 询单位中每个个最小数据块的指纹值中分别抽取部分比特位,将抽取的比特位 组成所述第一查询单位对应的的一个预设长度的索引值, 其中, 所述第一查询 单位中包括多个最小数据块,所述最小数据块为进行重复数据查找的最小查询 单位的数据块; 索引匹配单元 402, 用于在预先设置的索引表中查询是否有与所述第一查 询单位对应索引值相同的索引值, 获得匹配结果; Referring to FIG. 4, an embodiment of the present invention provides a structure of a data processing apparatus, including: an index construction unit 401, configured to: an index structure, where the index structure includes: data that needs to be repeated in a data query is covered by a sliding window As a first query unit, the data is extracted from the fingerprint values of each of the smallest data blocks in the first query unit, and the extracted bits are extracted. An index value of a preset length corresponding to the first query unit, where the first query unit includes a plurality of minimum data blocks, and the minimum data block is data of a minimum query unit for performing repeated data search. The index matching unit 402 is configured to query, in a preset index table, whether there is an index value that is the same as the index value corresponding to the first query unit, and obtain a matching result;
重复数据查找单元 403 , 用于若所述索引匹配单元在所述索引表中查询到 与所述第一查询单位对应索引值相同的第一索引值,则查找所述第一查询单位 中的数据是否有与所述第一索引值对应的数据存储地址指向的目标数据重复 的数据。  The duplicate data searching unit 403 is configured to: if the index matching unit searches for the first index value that is the same as the index value corresponding to the first query unit in the index table, search for data in the first query unit. Whether there is data that is duplicated by the target data pointed to by the data storage address corresponding to the first index value.
本发明实施例中, 第一查询单位中包括了多个最小数据块,将从每个最小 数据块的索引值中获取部分索引值而构造一个预设长度的索引,大幅度减少了 内存中索引的数量。 其中, 所述若重复数据查找单元 403得到所述第一查询单位中的数据和所 述第一索引值对应的数据存储地址指向的目标数据完全重复, 则: 所述索引构造单元 401 , 还用于将所述滑动窗口起始位置之前的数据, 作 为一个第二查询单位, 所述之前是针对所述滑动窗口滑动的反方向而言,根据 所述第二查询单位中的所述至少一个最小数据块构造一个所述预设长度的索 引值, 其中, 所述第二查询单位包括至少一个最小数据块; 所述索引匹配单元 402, 还用于在所述索引表中查询是否有与所述第二查 询单位对应索引值相同的第二索引值;  In the embodiment of the present invention, the first query unit includes a plurality of minimum data blocks, and a partial index value is obtained from the index value of each minimum data block to construct an index of a preset length, which greatly reduces the index in the memory. quantity. Wherein, if the repeated data searching unit 403 obtains that the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are completely duplicated, the index construction unit 401 is further used. The data before the start position of the sliding window is used as a second query unit, wherein the previous direction is the reverse direction of the sliding window sliding, according to the at least one minimum in the second query unit. The data block constructs an index value of the preset length, where the second query unit includes at least one minimum data block; the index matching unit 402 is further configured to query, in the index table, whether The second query unit corresponds to a second index value with the same index value;
所述重复数据查找 403单元还用于,若所述索引匹配单元 402在所述索引表 中查询到与所述第二查询单位对应索引值相同的第二索引值,则查找所述第二 查询单位中的数据中是否有与所述第二索引值对应的数据存储地址指向的目 标数据重复的数据。 其中, 若所述索引匹配单元 401在所述索引表中没有查询到与所述第二查 询单位对应索引值相同的第二索引值, 则所述数据处理装置还可以包括: 第一存储单元 404, 用于存储所述第二查询单位中的数据; The duplicate data lookup 403 unit is further configured to: if the index matching unit 402 is in the index table Searching for the second index value that is the same as the index value corresponding to the second query unit, searching whether the data in the second query unit has target data pointed to by the data storage address corresponding to the second index value Repeated data. The data processing apparatus may further include: the first storage unit 404, if the index matching unit 401 does not query the second index value that is the same as the index value corresponding to the second query unit in the index table, the data processing apparatus may further include: , for storing data in the second query unit;
第一索引表更新单元 405 , 用于将第二查询单位的数据对应的索引值与所 述第二查询单位中的数据的存储地址之间的对应关系插入到所述预先设置的 索引表中。  The first index table updating unit 405 is configured to insert a correspondence between an index value corresponding to the data of the second query unit and a storage address of the data in the second query unit into the preset index table.
本发明实施例中,在索引表中有第一查询单位对应的索引值,也有第二查 询单位对应的索引值,而第一查询单位和第二查询单位所包括的数据块的大小 不同,是索引表中形成了对应多种数据块大小的混合索引值,减少内存的同时, 提高重删查找率。 其中, 若所述重复数据查找单元 403查找到所述第一查询单位中数据和所 述第一索引值对应的数据存储地址指向的目标数据不完全重复,该装置还可以 包括:  In the embodiment of the present invention, the index value corresponding to the first query unit is included in the index table, and the index value corresponding to the second query unit is also included, and the size of the data block included in the first query unit and the second query unit is different. A hybrid index value corresponding to a plurality of data block sizes is formed in the index table, and the memory is reduced, and the double deletion search rate is improved. The device may further include: if the duplicate data searching unit 403 finds that the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely duplicated, the device may further include:
第一判断单元 406,用于在所述重复数据查找单元 403查找得到所述第一查 询单位中数据和所述第一索引值对应的数据存储地址指向的目标数据不完全 重复时, 判断所述滑动窗口起始位置之前的数据, 所述之前是针对所述滑动窗 口滑动的反方向而言, 是否达到所述第一查询单位的大小;  The first determining unit 406 is configured to determine, when the duplicate data searching unit 403 finds that the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated. Sliding data before the start position of the window, whether the size of the first query unit is reached for the reverse direction of the sliding of the sliding window;
第一指令单元 407,用于在所述第一判断单元 406判断所述滑动窗口判断所 述滑动窗口起始位置之前的数据未达到所述第一查询单位的大小的时候,以预 设的步长滑动所述滑动窗口; a first instruction unit 407, configured to determine, by the first determining unit 406, the sliding window determination When the data before the start position of the sliding window does not reach the size of the first query unit, the sliding window is slid in a preset step size;
所述索引构造单元 401 , 还用于将滑动后所述滑动窗口内的数据作为一个 第一查询单位, 进行构造索引。  The index construction unit 401 is further configured to construct a index by using the data in the sliding window after sliding as a first query unit.
其中, 若所述索引匹配单元 402在所述索引表中没有查询到与所述第一查 询单位对应索 1值相同的第一索引值, 该装置还可以包括: 第二判断单元 408,用于在所述重复数据查找单元 403查找得到所述第一查 询单位中数据和所述第一索引值对应的数据存储地址指向的目标数据不完全 重复时, 判断所述滑动窗口起始位置之前的数据, 所述之前是针对所述滑动窗 口滑动的反方向而言, 是否达到所述第一查询单位的大小; 第二指令单元 409, 用于在所述第二判断单元 408判断所述滑动窗口判断所 述滑动窗口起始位置之前的数据大小未达到所述第一查询单位的大小的时候, 以预设的步长滑动所述滑动窗口;  If the index matching unit 402 does not query the first index value that is the same as the first query unit in the index table, the device may further include: a second determining unit 408, configured to: When the repeated data searching unit 403 finds that the target data pointed to by the data in the first query unit and the data storage address corresponding to the first index value is not completely repeated, determining data before the start position of the sliding window The second instruction unit 409 is configured to determine, in the second determining unit 408, the sliding window to determine whether the size of the first query unit is reached in the opposite direction of the sliding of the sliding window. When the size of the data before the start position of the sliding window does not reach the size of the first query unit, the sliding window is slid in a preset step size;
所述索引构造单元 401 , 还用于将滑动后所述滑动窗口内的数据作为一个 第一查询单位, 进行构造索引。  The index construction unit 401 is further configured to construct a index by using the data in the sliding window after sliding as a first query unit.
在所述第二判断单元 408判断得到所述滑动窗口起始位置之前的数据大小 达到所述第一查询单位的大小,则说明滑动窗口划过的数据大小已经等于一个 第一查询单位的大小了,划过的数据块的大小也是曾经滑动窗口覆盖的数据大 小, 并且没有在索引表中查找到相同的索引。 因此, 直接将窗口前的数据作为 新数据进行存储, 因此, 所述数据处理装置, 还可以包括: 第二存储单元 410,用于若所述第二判断单元 408判断所述滑动窗口起始位 置之前的数据大小达到所述第一查询单位的大小,则将所述滑动窗口起始位置 之前的数据作为新数据进行存储; 存储地址的对应关系插入到所述索引表中。 After the second determining unit 408 determines that the data size before the start position of the sliding window reaches the size of the first query unit, it indicates that the data size of the sliding window has been equal to the size of a first query unit. The size of the swept data block is also the data size covered by the sliding window, and the same index is not found in the index table. Therefore, the data in front of the window is directly stored as new data. Therefore, the data processing apparatus may further include: a second storage unit 410, configured to: if the second determining unit 408 determines the start position of the sliding window The data size before the setting reaches the size of the first query unit, and the data before the start position of the sliding window is stored as new data; the correspondence of the storage addresses is inserted into the index table.
本发明实施例中,通过对多个最小数据分块构造一个索引值,在对数据进 行引用计数,压缩等以额定大小数据块为单位 艮大程度上减少了内存的占用, 并且重复数据查找过程中, 考虑数据偏移造成的重删率降低,将额定大小数据 块和小于额定大小的数据块混合的形式进行重复数据的查找, 提高重删率。 参见图 5, 本发明实施例还提供了一种数据处理装置,在图 4所提供的装置 的结构图的基础上提供了优化的方案, 本发明实施例和图 4对应的实施例中的 装置不同的是,若在所述索引表中没有查询到与所述第一查询单位对应索引值 相同的第一索引值,在判断所述滑动窗口起始位置之前的数据的大小是否达到 所述第一查询单位的大小之前,对所述第一查询单位对应的索引值中数据的位 置关系进行排列组合之后, 再将排列组合后的索引值在所述索引表中进行匹 配, 以提高匹配率。 本发明实施例所提供的数据处理装置, 包括: 索引构造单元 501 , 用于索引构造, 所述索引构造包括: 将需要重复数据 查询的数据中在滑动窗口所覆盖的数据作为一个第一查询单位,从所述第一查 询单位中每个个最小数据块的指纹值中分别抽取部分比特位,将抽取的比特位 组成所述第一查询单位对应的的一个预设长度的索引值, 其中, 所述第一查询 单位中包括多个最小数据块,所述最小数据块为进行重复数据查找的最小查询 单位的数据块; 索引匹配单元 502, 用于在预先设置的索引表中查询是否有与所述第一查 询单位对应索引值相同的索引值, 获得匹配结果; 重复数据查找单元 503 , 用于若所述索引匹配单元在所述索引表中查询到 与所述第一查询单位对应索引值相同的第一索引值,则查找所述第一查询单位 中的数据是否有与所述第一索引值对应的数据存储地址指向的目标数据重复 的数据。 其中, 所述若重复数据查找单元 503得到所述第一查询单位中的数据和所 述第一索引值对应的数据存储地址指向的目标数据完全重复, 则: 所述索引构造单元 501 , 还用于将所述滑动窗口起始位置之前的数据, 作 为一个第二查询单位, 所述之前是针对所述滑动窗口滑动的反方向而言,根据 所述第二查询单位中的所述至少一个最小数据块构造一个所述预设长度的索 引值, 其中, 所述第二查询单位包括至少一个最小数据块; 所述索引匹配单元 502, 还用于在所述索引表中查询是否有与所述第二查 询单位对应索引值相同的第二索引值; 所述重复数据查找 503单元还用于,若所述索引匹配单元 502在所述索引表 中查询到与所述第二查询单位对应索引值相同的第二索引值,则查找所述第二 查询单位中的数据中是否有与所述第二索引值对应的数据存储地址指向的目 标数据重复的数据。 其中, 若所述索引匹配单元 501在所述索引表中没有查询到与所述第二查 询单位对应索引值相同的第二索引值, 则所述数据处理装置还可以包括: 第一存储单元 504, 用于存储所述第二查询单位中的数据; 第一索引表更新单元 505 , 用于将第二查询单位的数据对应的索引值与所 述第二查询单位中的数据的存储地址之间的对应关系插入到所述预先设置的 索引表中。 其中, 在若所述索引匹配单元 502在所述索引表中没有查询到与所述第一 查询单位对应索引值相同的第一索引值,还可以用于: 在所述索引表中查询与 所述第一查询单位对应索引值匹配度等于或高于预设的匹配度的第三索引值; 所述索引表中没有与所述第一查询单位对应索引值匹配度等于或高于预设的 匹配度的第三索引值,则将所述第一查询单位对应的索引值中的数据的位置顺 序进行排列组合,判断进行排列组合后的所述第一查询单位对应的索引值在所 述索引表中是否查找到相同的第三索引值; 相应的, 所述数据处理装置还可以包括第二判断单元 508和第二指令单元In the embodiment of the present invention, by constructing an index value for a plurality of minimum data partitions, the data is reference counted, compressed, etc., in units of rated size data blocks, the memory occupation is greatly reduced, and the data search process is repeated. In the case of reducing the deduplication rate caused by the data offset, the repeated data is searched in a form in which the data block of the rated size and the data block smaller than the rated size are mixed, and the deduplication rate is improved. Referring to FIG. 5, an embodiment of the present invention further provides a data processing apparatus, which provides an optimized solution based on the structural diagram of the apparatus provided in FIG. 4, and the apparatus in the embodiment corresponding to the present invention and FIG. The difference is that if the first index value that is the same as the index value corresponding to the first query unit is not found in the index table, whether the size of the data before the start position of the sliding window reaches the first Before the size of the query unit, the positional relationship of the data in the index value corresponding to the first query unit is arranged and combined, and then the index values of the array combination are matched in the index table to improve the matching rate. The data processing apparatus provided by the embodiment of the present invention includes: an index construction unit 501, configured to: an index structure, where the index structure includes: data that is covered by the sliding window in the data that needs to be repeated data query as a first query unit And extracting a partial bit from each of the fingerprint values of the minimum data block in the first query unit, and extracting the extracted bits into an index value of a preset length corresponding to the first query unit, where The first query unit includes a plurality of minimum data blocks, and the minimum data block is a data block of a minimum query unit for performing repeated data search; The index matching unit 502 is configured to query, in a preset index table, whether there is an index value that is the same as the index value corresponding to the first query unit, to obtain a matching result, and a duplicate data searching unit 503, configured to: if the index matching unit Querying, in the index table, the first index value that is the same as the index value corresponding to the first query unit, and searching whether the data in the first query unit has a data storage address corresponding to the first index value. Duplicate data for the target data pointed to. If the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are completely duplicated, the index construction unit 501 is further used. The data before the start position of the sliding window is used as a second query unit, wherein the previous direction is the reverse direction of the sliding window sliding, according to the at least one minimum in the second query unit. The data block constructs an index value of the preset length, where the second query unit includes at least one minimum data block, and the index matching unit 502 is further configured to query, in the index table, whether The second query unit corresponds to a second index value with the same index value; the repeated data search 503 unit is further configured to: if the index matching unit 502 queries the index value corresponding to the second query unit in the index table The same second index value is used to find whether the data in the second query unit has a data storage address corresponding to the second index value. Data duplication of data. The data processing apparatus may further include: if the index matching unit 501 does not query the second index value that is the same as the index value corresponding to the second query unit in the index table, the data processing apparatus may further include: a first storage unit 504, configured to store data in the second query unit; a first index table update unit 505, configured to use an index value corresponding to data of the second query unit and data in the second query unit The correspondence between the storage addresses is inserted into the preset index table. If the index matching unit 502 does not query the first index value that is the same as the index value corresponding to the first query unit, the index matching unit 502 may be further configured to: query and query in the index table. The third index value corresponding to the index matching value of the first query unit is equal to or higher than the preset matching degree; the index value corresponding to the first query unit in the index table is equal to or higher than the preset value. a third index value of the matching degree, the positional order of the data in the index value corresponding to the first query unit is sequentially arranged and combined, and the index value corresponding to the first query unit after performing the permutation and combination is determined in the index Whether the same third index value is found in the table; correspondingly, the data processing apparatus may further include a second determining unit 508 and a second command unit
509; 所述第二判断单元 508在所述索引匹配单元最终未能匹配到所述第三索引 值时, 判断所述滑动窗口起始位置之前的数据大小, 所述之前是针对所述滑动 窗口滑动的反方向而言,是否达到所述第一查询单位的大小, 并将结果发送给 第二指令单元 509; 所述第二指令单元 509, 用于在所述第二判断单元 508判断所述滑动窗口判 断所述滑动窗口起始位置之前的数据大小未达到所述第一查询单位的大小的 时候, 以预设的步长滑动所述滑动窗口; 所述索引构造单元 501 , 还用于将滑动后所述滑动窗口内的数据作为一个 一查询单位, 进行构造索引。 其中, 所述数据处理装置, 还可以包括还包括: 第二存储单元 510,用于若所述第二判断单元 508判断所述滑动窗口起始位 置之前的数据大小达到所述第一查询单位的大小,则将所述滑动窗口起始位置 之前的数据作为新数据进行存储; 509. The second determining unit 508 determines, before the index matching unit finally fails to match the third index value, a data size before the start position of the sliding window, where the previous is for the sliding window. In the reverse direction of the sliding, whether the size of the first query unit is reached, and the result is sent to the second instruction unit 509; the second instruction unit 509 is configured to determine, at the second determining unit 508, the When the sliding window determines that the data size before the start position of the sliding window does not reach the size of the first query unit, the sliding window is slid in a preset step size; the index construction unit 501 is further configured to Sliding the data in the sliding window as a A query unit, constructing an index. The data processing apparatus may further include: a second storage unit 510, configured to: if the second determining unit 508 determines that the data size before the start position of the sliding window reaches the first query unit Size, the data before the start position of the sliding window is stored as new data;
存储地址的对应关系插入到所述索引表中。 其中, 所述重复数据查找单元 503 ,还用于若所述索引匹配单元 502索引表 中查询到与所述第一查询单位对应的第一索引值匹配度等于或高于预设的匹 配度的第三索引值,则查找所述第一查询单位中的数据中是否有与所述第三索 引值对应的数据存储地址指向的目标数据重复的数据。 其中 ,若所述第一查询单位中的数据和所述第一索引值对应的数据存储地 址指向的目标数据不完全重复, 所述数据处理装置还可以包括: The correspondence of the storage addresses is inserted into the index table. The duplicate data searching unit 503 is further configured to: if the index matching unit 502 searches the index table, the first index value corresponding to the first query unit is equal to or higher than a preset matching degree. And the third index value is used to search for data in the first query unit for data that is duplicated by the target data pointed to by the data storage address corresponding to the third index value. The data processing apparatus may further include: if the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely duplicated, the data processing apparatus may further include:
Delta压缩单元 506, 用于获取所述重复数据查找单元 503数据查找结果, 若所述第一查询单位中的数据和所述第一索引值对应的数据存储地址指向的 目标数据不完全重复,则相对于所述第一索引值对应的数据存储地址指向的目 标数据, 对所述第一查询单位中的数据做 delta压缩; 所述第二存储单元 510, 对完成 delta压缩后的数据存储到存储节点; 所述第二索引表更新单元 511还用于, 将所述第一查询单位对应的索引值 和所述 delta压缩后的数据存储地址的对应关系插入到索引表中。 本方法实施例所提供的数据处理装置,针对包含了多个最小查询单位的额 定大小数据块来做索引进行查找, 大大减少了内存占用, 进一步, 本发明实施 例通过对滑动窗口按照最小查询单位进行滑动后,再进行判断,避免数据查找 中出现偏移导致的重删率降低, 实现混合粒度的重复数据查找,提高了重删率 的同时降低了内存占用, 并且, 在做索引匹配时, 通过对第一查询单位中的索 引值中的数据进行排列组合, 提高了重复数据查找率。 a delta compression unit 506, configured to acquire a data search result of the duplicate data searching unit 503, if the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, And performing delta compression on the data in the first query unit with respect to the target data pointed to by the data storage address corresponding to the first index value; and storing, by the second storage unit 510, data stored in the delta compression to the storage The second index table updating unit 511 is further configured to insert the correspondence between the index value corresponding to the first query unit and the data storage address after the delta compression into the index table. The data processing apparatus provided by the embodiment of the present invention performs index searching for the rated size data block including the plurality of minimum query units, thereby greatly reducing the memory occupation. Further, the embodiment of the present invention adopts the minimum query unit by sliding the window. After the sliding, the judgment is performed to avoid the deduplication rate caused by the offset in the data search, the repeated data search of the mixed granularity is realized, the deduplication rate is improved, the memory occupation is reduced, and when index matching is performed, By arranging and combining the data in the index values in the first query unit, the repeated data search rate is improved.
参见图 6, 本发明实施例还提供一种重删处理器 600, 包括处理器 61 , 存储 器 62, 通信接口 63 , 通信总线 64;  Referring to FIG. 6, an embodiment of the present invention further provides a deduplication processor 600, including a processor 61, a memory 62, a communication interface 63, and a communication bus 64.
所述处理器 61、 通信接口 63、 存储器 62通过所述通信总线 64相互的通信; 所述通信接口, 用于接收和发送数据;  The processor 61, the communication interface 63, and the memory 62 communicate with each other through the communication bus 64; the communication interface is configured to receive and transmit data;
所述存储器 62用于存储程序; 存储器 62可能包含高速 RAM存储器, 也 可能还包括非易失性存储器(non-volatile memory ) , 例如至少一个磁盘存 储器;  The memory 62 is for storing a program; the memory 62 may include a high speed RAM memory, and may also include a non-volatile memory such as at least one disk memory;
所述处理器 61用于执行所述存储器中的所述程序,执行如前述方法实施 例所提供的数据处理方法。  The processor 61 is configured to execute the program in the memory to execute a data processing method as provided by the foregoing method embodiments.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使 用时, 可以存储在一个计算机可读取存储介质中。 基于这样的理解, 本发 明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的 部分可以以软件产品的形式体现出来, 该计算机软件产品存储在一个存储 介质中, 包括若干指令用以使得一台计算机设备(可以是个人计算机, 服 务器, 或者网络设备等)执行本发明各个实施例所述方法的全部或部分步 骤。而前述的存储介质包括: U盘、移动硬盘、只读存储器(ROM, Read-Only Memory ) 、 随机存取存者器 (RAM, Random Access Memory ) 、 磁碟或 者光盘等各种可以存储程序代码的介质。 以上所述, 仅为本发明的具体实施方式, 但本发明的保护范围并不局限于 此, 任何熟悉本技术领域的技术人员在本发明揭露的技术范围内, 可轻易 想到变化或替换, 都应涵盖在本发明的保护范围之内。 因此, 本发明的保 护范围应所述以权利要求的保护范围为准。 The functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including The instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a USB flash drive, a mobile hard disk, and a read only memory (ROM, Read-Only) Memory ), Random Access Memory (RAM), disk or optical disk, and other media that can store program code. The above is only the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the claims.

Claims

权 利 要 求 Rights request
1、 一种数据处理方法, 其特征在于, 所述方法应用于数据处理系统, 所述 数据处理系统包括重删处理器, 所述方法包括: 所述重删处理器将滑动窗口所覆盖的需要重复数据查找的数据作为第一 查询单位, 所述第一查询单位中包括多个最小数据块, 所述最小数据块为进行 重复数据查找的最小查询单位的数据块; 对所述第一查询单位中的数据进行 索引构造和重复数据查找; 所述索引构造, 包括: 从所述第一查询单位中每个最小数据块的指纹值中 分别抽取部分比特位,将抽取的比特位组成所述第一查询单位对应的一个预设 长度的索引值; 所述重复数据查找, 包括: 在预先设置的索引表中查询是否有与所述第一 查询单位对应索引值相同的索引值,若在所述索引表中查询到与所述第一查询 单位对应索引值相同的第一索引值,则查找所述第一查询单位中的数据是否有 与所述第一索引值对应的数据存储地址指向的目标数据重复的数据。 1. A data processing method, characterized in that the method is applied to a data processing system, the data processing system includes a deduplication processor, and the method includes: the deduplication processor covers the needs of the sliding window The data for repeated data search is used as the first query unit, and the first query unit includes a plurality of minimum data blocks, and the minimum data block is the data block of the minimum query unit for repeated data search; for the first query unit Index construction and repeated data search are carried out on the data in the first query unit; the index construction includes: respectively extracting some bits from the fingerprint value of each minimum data block in the first query unit, and using the extracted bits to form the third An index value of a preset length corresponding to a query unit; The repeated data search includes: querying in a preset index table whether there is an index value that is the same as the index value corresponding to the first query unit. If in the If the first index value that is the same as the corresponding index value of the first query unit is found in the index table, then search whether the data in the first query unit has a target pointed to by the data storage address corresponding to the first index value. Data is duplicated.
2、 根据权利要求 1所述的方法, 其特征在于, 该方法还包括: 若所述第一 查询单位中的数据和所述第一索引值对应的数据存储地址指向的目标数据完 全重复: 将所述滑动窗口起始位置之前的数据,作为第二查询单位, 所述之前是针 对所述滑动窗口滑动的反方向而言,所述第二查询单位包括至少一个最小数据 块,根据所述第二查询单位中的所述至少一个最小数据块构造一个所述预设长 度的索引值,在所述索引表中查询是否有与所述第二查询单位对应索引值相同 的第二索引值; 若在所述索引表中查询到与所述第二查询单位对应索引值相同的第二索 引值,则查找所述第二查询单位中的数据中是否有与所述第二索引值对应的数 据存储地址指向的目标数据重复的数据。 2. The method according to claim 1, characterized in that, the method further includes: If the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are completely repeated: The data before the starting position of the sliding window is used as the second query unit. The before is for the opposite direction of the sliding window sliding. The second query unit includes at least one minimum data block. According to the first The at least one minimum data block in the two query units constructs an index value of the preset length, and queries the index table to see whether there is an index value that is the same as the corresponding index value of the second query unit. the second index value of The data storage address corresponding to the second index value points to duplicate data of the target data.
3、 根据权利要求 2所述的方法, 其特征在于, 还包括: 若在所述索引表中没有查询到与所述第二查询单位对应索引值相同的第 二索引值,则将所述第二查询单位对应的索引值与所述第二查询单位中的数据 的存储地址之间的对应关系插入到所述索引表中。 3. The method according to claim 2, further comprising: if the second index value that is the same as the index value corresponding to the second query unit is not found in the index table, then adding the second index value to the second query unit. The corresponding relationship between the index values corresponding to the two query units and the storage address of the data in the second query unit is inserted into the index table.
4、 根据权利要求 1所述的方法, 其特征在于, 若所述第一查询单位中数据 和所述第一索引值对应的数据存储地址指向的目标数据不完全重复,该方法还 包括: 判断所述滑动窗口起始位置之前的数据大小是否达到所述第一查询单位 的大小, 所述之前是针对所述滑动窗口滑动的反方向而言, 如果否, 则以预设 的步长滑动所述滑动窗口,将滑动后所述滑动窗口内的数据作为一个第一查询 单位, 执行所述构造索引的步骤和所述重复数据查找的步骤。 4. The method according to claim 1, characterized in that, if the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, the method further includes: determining Whether the data size before the starting position of the sliding window reaches the size of the first query unit, which is for the opposite direction of sliding of the sliding window, if not, slide all the data with a preset step size. In the sliding window, the data in the sliding window after sliding is used as a first query unit, and the step of constructing an index and the step of searching for repeated data are performed.
5、 根据权利要求 1或 2或 3所述的方法, 其特征在于, 若在所述索引表中没 有查询到与所述第一查询单位对应索 ^ 1值相同的第一索引值, 该方法还包括: 判断所述滑动窗口起始位置之前的数据大小是否达到所述第一查询单位 的大小, 所述之前是针对所述滑动窗口滑动的反方向而言, 如果否, 则以预设 的步长滑动所述滑动窗口, 将滑动后所述滑动窗口内的数据作为第一查询单 位, 执行所述构造索引的步骤和所述重复数据查找的步骤。 5. The method according to claim 1 or 2 or 3, characterized in that if the first index value that is the same as the index value corresponding to the first query unit is not found in the index table, the method It also includes: determining whether the data size before the starting position of the sliding window reaches the size of the first query unit, where the previous position is for the opposite direction of sliding of the sliding window, and if not, then using a preset The sliding window is slid in steps, and the data in the sliding window after sliding is used as the first query order. bit, perform the step of constructing an index and the step of searching for repeated data.
6、 根据权利要求 5所述的方法, 其特征在于, 若判断所述滑动窗口起始位 置之前的数据大小达到所述第一查询单位的大小,则将所述滑动窗口起始位置 之前的数据作为新数据进行存储,将所述新数据对应的索引值与所述新数据存 储地址之间的对应关系插入到所述索引表中。 6. The method according to claim 5, characterized in that, if it is determined that the data size before the starting position of the sliding window reaches the size of the first query unit, then the data before the starting position of the sliding window is Store it as new data, and insert the corresponding relationship between the index value corresponding to the new data and the new data storage address into the index table.
7、 根据权利要求 5所述的方法, 其特征在于, 若在所述索引表中没有查询 到与所述第一查询单位对应索引值相同的第一索引值,在判断所述滑动窗口起 始位置之前的数据的大小是否达到所述第一查询单位的大小之前,该方法还包 括: 在所述索引表中查询与所述第一查询单位对应索引值匹配度等于或高于 预设的匹配度的第三索引值,若所述索引表中没有与所述第一查询单位对应索 引值匹配度等于或高于预设的匹配度的第三索引值,则将所述第一查询单位对 应的索引值中的数值的位置顺序进行排列组合,判断排列组合后的所述第一查 询单位对应的索引值在所述索引表中是否查找到所述第三索引值,如果没有找 到,则进入所述判断所述滑动窗口起始位置之前的数据是否达到所述第一查询 单位的大小的步骤。 7. The method according to claim 5, characterized in that, if the first index value that is the same as the index value corresponding to the first query unit is not queried in the index table, before determining the starting point of the sliding window Whether the size of the data before the position reaches the size of the first query unit, the method further includes: querying in the index table a matching degree with the index value corresponding to the first query unit that is equal to or higher than a preset match If there is no third index value in the index table that has a matching degree equal to or higher than the preset matching degree with the index value corresponding to the first query unit, then the first query unit corresponding to Arrange and combine the numerical values in the index values in order, and determine whether the index value corresponding to the first query unit after the arrangement and combination is found in the index table. If the third index value is not found, enter The step of determining whether the data before the starting position of the sliding window reaches the size of the first query unit.
8、 根据权利要求 7所述的方法, 其特征在于, 所述将所述第一查询单位中 的数据的位置顺序进行排列组合,判断排列组合后的所述第一查询单位对应的 索引值在所述索引表中是否查找到所述第三索引值, 包括: 将所述第一查询单位对应的索引值划分为多个部分,将所述多个部分在所 述第一查询单位对应的索引值中的位置顺序进行第一次排列组合; 判断排列组合后的所述第一查询单位对应的索引值在所述索引表中是否 查找到匹配度等于或高于预设的匹配度的第三索引值,如果没有找到, 则将所 述多个部分在所述第一查询单位对应的索引值中的位置顺序进行第二次排列 组合;继续判断排列组合后的所述第一查询单位对应的索引值在所述索引表中 是否查找到相同的第三索引值; 当完成了所有的排列组合后的所述第一查询单位对应的索引值在所述索 引表中没有查找到与所述第一索引值匹配度等于或高于预设的匹配度的第三 索引值, 则停止排列组合, 进入所述判断所述滑动窗口起始位置之前的数据的 大小是否达到所述第一查询单位的大小的步骤。 8. The method according to claim 7, characterized in that: the position order of the data in the first query unit is arranged and combined, and it is determined that the index value corresponding to the first query unit after the arrangement and combination is in Whether the third index value is found in the index table includes: dividing the index value corresponding to the first query unit into multiple parts, and placing the multiple parts in the index corresponding to the first query unit. The position order in the value is arranged and combined for the first time; Determine whether the index value corresponding to the permuted and combined first query unit is found in the index table to find a third index value with a matching degree equal to or higher than the preset matching degree. If not found, then the multiple index values are found. The positions of the parts in the index value corresponding to the first query unit are sequentially arranged and combined for the second time; continue to determine whether the index value corresponding to the first query unit after the arrangement and combination is found to be the same in the index table. The third index value of If the third index value of the matching degree is reached, the permutation and combination is stopped, and the step of judging whether the size of the data before the starting position of the sliding window reaches the size of the first query unit is entered.
9、 根据权利要求 7所述的方法, 其特征在于, 若所述索引表中查询到与所 述第一查询单位对应的第一索引值匹配度等于或高于预设的匹配度的第三索 引值,则查找所述第一查询单位中的数据中是否有与所述第三索引值对应的数 据存储地址指向的目标数据重复的数据。 9. The method according to claim 7, characterized in that, if a third index value corresponding to the first query unit whose matching degree is equal to or higher than a preset matching degree is queried in the index table, index value, then search whether there is data in the data in the first query unit that is duplicated with the target data pointed to by the data storage address corresponding to the third index value.
10、 根据权利要求 9所述的方法, 其特征在于, 还包括: 获得重复数据查找结果,若所述第一查询单位中的数据和所述第一索引值 对应的数据存储地址指向的目标数据不完全重复,则相对于所述第一索引值对 应的数据存储地址指向的目标数据, 对所述第一查询单位中的数据做 delta压 缩, 对完成 delta压缩后的数据进行存储, 将所述第一查询单位对应的索引值和 所述 delta压缩后的数据存储地址的对应关系插入到索引表中。 10. The method according to claim 9, further comprising: obtaining a duplicate data search result, if the data in the first query unit and the data storage address corresponding to the first index value point to the target data If the data is not completely repeated, perform delta compression on the data in the first query unit relative to the target data pointed to by the data storage address corresponding to the first index value, store the data after delta compression, and store the data in the first query unit. The corresponding relationship between the index value corresponding to the first query unit and the delta-compressed data storage address is inserted into the index table.
11、 根据权利要求 1-3任一所述方法, 其特征在于, 所述从所述第一查询 单位中每个最小数据块的指纹值中分别抽取部分比特位,将抽取的比特位组成 所述第一查询单位对应的的一个预设长度的索引值, 包括: 分别获取所述第一查询单位中每个最小数据块的指纹值,从每个最小数据 块对应的指纹值中抽取预设的相同个数的比特位,将所有抽取的比特位组成对 应于所述第一查询单位的预设长度的索引值。 11. The method according to any one of claims 1 to 3, characterized in that: extracting part of the bits from the fingerprint value of each minimum data block in the first query unit, and combining the extracted bits into An index value of a preset length corresponding to the first query unit includes: respectively obtaining the fingerprint value of each minimum data block in the first query unit, and extracting a predetermined index value from the fingerprint value corresponding to each minimum data block. Assuming the same number of bits, all the extracted bits are composed into an index value corresponding to the preset length of the first query unit.
12、 根据权利要求 1-3任一所述方法, 其特征在于, 所述索引表存储在所 述重删处理器,所述在预先设置的索引表中查询是否有与所述第一查询单位对 应索引值相同的索引值, 包括: 在所述重删处理器中的预先设置的索引表中查询是否有与所述第一查询 单位对应索引值相同的索引值; 或者 所述数据处理系统还包括存储节点,所述预先设置的索引表存储在所述存 储节点中 ,所述在预先设置的索引表中查询是否有与所述第一查询单位对应索 引值相同的索引值, 包括: 所述重删处理器将所述第一查询单位对应的索引值发送到所述存储节点 中,接收所述存储节点查询结果,从而获得在所述预先设置的索引表中查询是 否有与所述第一查询单位对应索引值相同的索引值的信息。 12. The method according to any one of claims 1 to 3, characterized in that, the index table is stored in the deduplication processor, and the preset index table is queried to see whether there is any data related to the first query unit. Index values corresponding to the same index value include: querying in the preset index table in the deduplication processor whether there is an index value that is the same as the index value corresponding to the first query unit; or the data processing system also It includes a storage node, the preset index table is stored in the storage node, and the query in the preset index table whether there is an index value that is the same as the index value corresponding to the first query unit includes: The deduplication processor sends the index value corresponding to the first query unit to the storage node, receives the query result of the storage node, and thereby obtains a query in the preset index table to see whether there is an index value corresponding to the first query unit. Query information about index values whose units correspond to the same index value.
13、 一种数据处理装置, 其特征在于, 包括: 索引构造单元, 用于索引构造, 所述索引构造包括: 将需要重复数据查询 的数据中在滑动窗口所覆盖的数据作为一个第一查询单位,从所述第一查询单 位中每个最小数据块的指纹值中分别抽取部分比特位,将抽取的比特位组成所 述第一查询单位对应的一个预设长度的索引值, 其中, 所述第一查询单位中包 括多个最小数据块,所述最小数据块为进行重复数据查找的最小查询单位的数 据块; 索引匹配单元,用于在预先设置的索引表中查询是否有与所述第一查询单 位对应索引值相同的索引值; 重复数据查找单元,用于若所述索引匹配单元在所述索引表中查询到与所 述第一查询单位对应索引值相同的第一索引值,则查找所述第一查询单位中的 数据是否有与所述第一索引值对应的数据存储地址指向的目标数据重复的数 据。 13. A data processing device, characterized in that it includes: an index construction unit, used for index construction, the index construction includes: using the data covered by the sliding window among the data requiring repeated data query as a first query unit , extract some bits from the fingerprint value of each minimum data block in the first query unit, and use the extracted bits to form an index value of a preset length corresponding to the first query unit, where, First query unit mid-package It includes a plurality of minimum data blocks, and the minimum data block is the data block of the minimum query unit for repeated data search; an index matching unit is used to query whether there is an index corresponding to the first query unit in the preset index table Index values with the same value; a duplicate data search unit, used to search for the first index value if the index matching unit queries the first index value in the index table that is the same as the index value corresponding to the first query unit; Query whether the data in the unit has duplicate data with the target data pointed to by the data storage address corresponding to the first index value.
14、 根据权利要求 13所述的数据处理装置, 其特征在于, 所述若重复数据 查找单元得到所述第一查询单位中的数据和所述第一索引值对应的数据存储 地址指向的目标数据完全重复, 则 所述索引构造单元,还用于将所述滑动窗口起始位置之前的数据,作为一 个第二查询单位, 所述之前是针对所述滑动窗口滑动的反方向而言,根据所述 第二查询单位中的所述至少一个最小数据块构造一个所述预设长度的索引值, 其中, 所述第二查询单位包括至少一个最小数据块; 所述索引匹配单元,还用于在所述索引表中查询是否有与所述第二查询单 位对应索引值相同的第二索引值; 所述重复数据查找单元还用于,若所述索引匹配单元在所述索引表中查询 到与所述第二查询单位对应索引值相同的第二索引值,则查找所述第二查询单 位中的数据中是否有与所述第二索引值对应的数据存储地址指向的目标数据 重复的数据。 14. The data processing device according to claim 13, wherein the duplicate data search unit obtains the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value. Repeat completely, then the index construction unit is also used to use the data before the starting position of the sliding window as a second query unit, where the previous is for the opposite direction of the sliding window sliding, according to the The at least one minimum data block in the second query unit constructs an index value of the preset length, wherein the second query unit includes at least one minimum data block; the index matching unit is also used to Query in the index table whether there is a second index value that is the same as the corresponding index value of the second query unit; the duplicate data search unit is also used to query if the index matching unit queries in the index table If the second query unit corresponds to a second index value with the same index value, search whether there is data in the data in the second query unit that is duplicated with the target data pointed to by the data storage address corresponding to the second index value.
15、 根据权利要求 14所述的数据处理装置, 其特征在于, 若所述索引匹配 单元在所述索引表中没有查询到与所述第二查询单位对应索引值相同的第二 索引值, 则还包括: 第一存储单元, 用于存储所述第二查询单位中的数据; 第一索引表更新单元,用于将第二查询单位的数据对应的索引值与所述第 二查询单位中的数据的存储地址之间的对应关系插入到所述预先设置的索引 表中。 15. The data processing device according to claim 14, wherein if the index matching unit does not query the second index value in the index table that is the same as the index value corresponding to the second query unit, then It also includes: a first storage unit, used to store the data in the second query unit; a first index table update unit, used to compare the index value corresponding to the data in the second query unit with the index value in the second query unit. The corresponding relationship between the storage addresses of the data is inserted into the preset index table.
16、 根据权利要求 13所述的数据处理装置, 其特征在于, 若所述重复数据 查找单元查找到所述第一查询单位中数据和所述第一索引值对应的数据存储 地址指向的目标数据不完全重复, 该装置还包括: 16. The data processing device according to claim 13, wherein if the duplicate data search unit finds the target data pointed to by the data storage address corresponding to the data in the first query unit and the first index value Not exactly a repeat, the device also includes:
第一判断单元,用于在所述重复数据查找单元查找得到所述第一查询单位 中数据和所述第一索引值对应的数据存储地址指向的目标数据不完全重复时, 判断所述滑动窗口起始位置之前的数据,所述之前是针对所述滑动窗口滑动的 反方向而言, 是否达到所述第一查询单位的大小; 第一指令单元, 用于在所述第一判断单元判断所述滑动窗口判断所述滑动 窗口起始位置之前的数据未达到所述第一查询单位的大小的时候,以预设的步 长滑动所述滑动窗口; 所述索引构造单元, 还用于将滑动后所述滑动窗口内的数据作为一个第一 查询单位, 进行构造索引。 A first judgment unit configured to judge the sliding window when the duplicate data search unit finds that the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated. The data before the starting position, which is for the opposite direction of sliding of the sliding window, whether it reaches the size of the first query unit; the first instruction unit is used to determine the first query unit in the first judgment unit. When the sliding window determines that the data before the starting position of the sliding window does not reach the size of the first query unit, the sliding window is slid with a preset step size; the index construction unit is also used to slide the sliding window. The data in the sliding window described later is used as a first query unit to construct an index.
17、 根据权利要求 13或 14或 15所述的数据处理装置, 其特征在于, 若所 述索引匹配单元在所述索引表中没有查询到与所述第一查询单位对应索引值 相同的第一索引值, 该装置还包括: 第二判断单元,用于在所述重复数据查找单元查找得到所述第一查询单位 中数据和所述第一索引值对应的数据存储地址指向的目标数据不完全重复时, 判断所述滑动窗口起始位置之前的数据,所述之前是针对所述滑动窗口滑动的 反方向而言, 是否达到所述第一查询单位的大小; 第二指令单元, 用于在所述第二判断单元判断所述滑动窗口判断所述滑动 窗口起始位置之前的数据大小未达到所述第一查询单位的大小的时候,以预设 的步长滑动所述滑动窗口; 所述索引构造单元, 还用于将滑动后所述滑动窗口内的数据作为一个第一 查询单位, 进行构造索引。 17. The data processing device according to claim 13 or 14 or 15, characterized in that if The index matching unit does not query the first index value that is the same as the corresponding index value of the first query unit in the index table. The device further includes: a second judgment unit for searching in the duplicate data search unit When it is obtained that the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, determine the data before the starting position of the sliding window, which is for the sliding window. In the opposite direction of window sliding, whether the size of the first query unit is reached; the second instruction unit is used to determine the data size before the second judgment unit determines the sliding window and determines the starting position of the sliding window. When the size of the first query unit is not reached, slide the sliding window with a preset step size; the index construction unit is also used to use the data in the sliding window after sliding as a first query unit , to construct the index.
18、 根据权利要求 17所述的数据处理装置, 其特征在于, 还包括: 第二存储单元,用于若所述第二判断单元判断所述滑动窗口起始位置之前 的数据大小达到所述第一查询单位的大小,则将所述滑动窗口起始位置之前的 数据作为新数据进行存储; 第二索引表更新单元,用于将所述新数据对应的索引值与所述新数据存储 地址的对应关系插入到所述索引表中。 18. The data processing device according to claim 17, further comprising: a second storage unit, configured to determine if the second judgment unit determines that the data size before the starting position of the sliding window reaches the first A query unit size, then the data before the starting position of the sliding window is stored as new data; a second index table update unit is used to compare the index value corresponding to the new data with the new data storage address The corresponding relationship is inserted into the index table.
19、 根据权利要求 17所述的数据处理装置, 其特征在于, 所述索引匹配单 元若在所述索引表中没有查询到与所述第一查询单位对应索引值相同的第一 索引值,还用于,在所述索引表中查询与所述第一查询单位对应索引值匹配度 等于或高于预设的匹配度的第三索引值;所述索引表中没有与所述第一查询单 位对应索引值匹配度等于或高于预设的匹配度的第三索引值,则将所述第一查 询单位对应的索引值中的数值的位置顺序进行排列组合,判断进行排列组合后 的所述第一查询单位对应的索引值在所述索引表中是否查找到相同的第三索 引值; 所述第二判断单元,还用于在所述索引匹配单元最终未能匹配到所述第三 索引值时, 判断所述滑动窗口起始位置之前的数据大小, 所述之前是针对所述 滑动窗口滑动的反方向而言,是否达到所述第一查询单位的大小, 并将结果发 送给所述第二指令单元。 19. The data processing device according to claim 17, wherein if the index matching unit does not query the first index value that is the same as the index value corresponding to the first query unit in the index table, Used to query the matching degree of the index value corresponding to the first query unit in the index table A third index value that is equal to or higher than the preset matching degree; there is no third index value in the index table that matches the index value corresponding to the first query unit that is equal to or higher than the preset matching degree, then the The position order of the numerical values in the index value corresponding to the first query unit is arranged and combined, and it is determined whether the index value corresponding to the first query unit after the arrangement and combination finds the same third index in the index table. value; The second judgment unit is also used to judge the data size before the starting position of the sliding window when the index matching unit finally fails to match the third index value, and the previous value is for the In terms of the opposite direction of sliding of the sliding window, whether the size of the first query unit is reached, and the result is sent to the second instruction unit.
20、 根据权利要求 19所述的数据处理装置, 其特征在于, 所述索引匹配单 元若在所述索引表中没有查询到与所述第一查询单位对应索引值相同的第一 索引值, 所述索引匹配单元具体用于: 将所述第一查询单位对应的索引值划分为多个部分,将所述多个部分在所 述第一查询单位对应的索引值中的位置顺序进行第一次排列组合; 判断排列组合后的所述第一查询单位对应的索引值在所述索引表中是否 查找到匹配度等于或高于预设的匹配度的第三索引值,如果没有找到, 则将所 述多个部分在所述第一查询单位对应的索引值中的位置顺序进行第二次排列 组合;继续判断排列组合后的所述第一查询单位对应的索引值在所述索引表中 是否查找到相同的第三索引值; 当完成了所有的排列组合后的所述第一查询单位对应的索引值在所述索 引表中没有查找到与所述第一索引值匹配度等于或高于预设的匹配度的第三 索引值, 则停止排列组合, 将判断结果发送给所述第二判断单元。 20. The data processing device according to claim 19, characterized in that if the index matching unit does not query the first index value that is the same as the index value corresponding to the first query unit in the index table, then The index matching unit is specifically configured to: divide the index value corresponding to the first query unit into multiple parts, and sequence the positions of the multiple parts in the index value corresponding to the first query unit for the first time. Arrange and combine; Determine whether the index value corresponding to the first query unit after the arrangement and combination finds a third index value in the index table with a matching degree equal to or higher than the preset matching degree. If not found, then The positions of the multiple parts in the index value corresponding to the first query unit are sequentially arranged and combined for the second time; continue to determine whether the index value corresponding to the first query unit after the arrangement and combination is in the index table. The same third index value is found; when all permutations and combinations are completed, the index value corresponding to the first query unit is not found in the index table with a matching degree equal to or higher than the first index value. Default matching degree third index value, stop the permutation and combination, and send the judgment result to the second judgment unit.
21、 根据权利要求 19所述数据处理装置, 其特征在于, 所述重复数据查找 单元,还用于若所述索引匹配单元在所述索引表中查询到与所述第一查询单位 对应的第一索引值匹配度等于或高于预设的匹配度的第三索引值,则查找所述 第一查询单位中的数据中是否有与所述第三索引值对应的数据存储地址指向 的目标数据重复的数据。 21. The data processing device according to claim 19, characterized in that, the repeated data search unit is also configured to query the first query unit corresponding to the first query unit if the index matching unit queries the index table. If the matching degree of an index value is equal to or higher than the third index value of the preset matching degree, then search whether there is target data pointed to by the data storage address corresponding to the third index value in the data in the first query unit. Duplicate data.
22、 根据权利要求 21所述的装置, 其特征在于, 还包括: 22. The device according to claim 21, further comprising:
Delta压缩单元, 用于获取所述重复数据查找单元的重复数据查找结果, 若所述第一查询单位中的数据和所述第一索引值对应的数据存储地址指向的 目标数据不完全重复,则相对于所述第一索引值对应的数据存储地址指向的目 标数据, 对所述第一查询单位中的数据做 delta压缩; 所述第二存储单元还用于, 对完成 delta压缩后的数据进行存储; 所述第二索引表更新单元还用于 ,将所述第一查询单位对应的索引值和所 述 delta压缩后的数据存储地址的对应关系插入到索引表中。 Delta compression unit, used to obtain the duplicate data search results of the duplicate data search unit. If the data in the first query unit and the target data pointed to by the data storage address corresponding to the first index value are not completely repeated, then Relative to the target data pointed to by the data storage address corresponding to the first index value, perform delta compression on the data in the first query unit; the second storage unit is also used to perform delta compression on the data after completion of delta compression. Storage; The second index table update unit is also configured to insert the corresponding relationship between the index value corresponding to the first query unit and the delta-compressed data storage address into the index table.
23、 根据权利要求 13-15任一所述的数据处理装置, 所述索引构造单元具 体用于分别获取所述第一查询单位中每个最小数据块的指纹值,从每个最小数 据块对应的指纹值中抽取预设的相同个数的比特位,将所有抽取的比特位组成 对应于所述第一查询单位的预设长度的索引值。 23. The data processing device according to any one of claims 13 to 15, the index construction unit is specifically configured to obtain the fingerprint value of each minimum data block in the first query unit, corresponding to each minimum data block from Extract a preset same number of bits from the fingerprint value, and combine all the extracted bits into an index value corresponding to the preset length of the first query unit.
24、 一种重删处理器, 其特征在于, 包括处理器, 存储器, 通信接口, 总 线; 24. A deduplication processor, characterized in that it includes a processor, a memory, a communication interface, and a bus;
所述处理器、通信接口、存储器通过所述总线相互的通信;所述通信接口, 用于接收和发送数据; The processor, communication interface, and memory communicate with each other through the bus; the communication interface, For receiving and sending data;
所述存储器用于存储程序; 所述处理器用于执行所述存储器中的所述程序, 执行如权利要求 1-12 任一所述的方法。 The memory is used to store programs; the processor is used to execute the program in the memory, and perform the method as described in any one of claims 1-12.
PCT/CN2013/086253 2013-10-30 2013-10-30 Data processing method, device, and duplication processor WO2015061995A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201380002568.0A CN103930890B (en) 2013-10-30 2013-10-30 Data processing method, device and heavily delete processor
PCT/CN2013/086253 WO2015061995A1 (en) 2013-10-30 2013-10-30 Data processing method, device, and duplication processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2013/086253 WO2015061995A1 (en) 2013-10-30 2013-10-30 Data processing method, device, and duplication processor

Publications (1)

Publication Number Publication Date
WO2015061995A1 true WO2015061995A1 (en) 2015-05-07

Family

ID=51147967

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/086253 WO2015061995A1 (en) 2013-10-30 2013-10-30 Data processing method, device, and duplication processor

Country Status (2)

Country Link
CN (1) CN103930890B (en)
WO (1) WO2015061995A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3223167B1 (en) 2014-12-16 2018-11-21 Huawei Technologies Co., Ltd. Storage space management method and device
CA2977742C (en) * 2016-09-28 2024-04-16 Huawei Technologies Co., Ltd. Method for deduplication in storage system, storage system, and controller
CN109284424B (en) * 2018-09-21 2021-10-19 长沙学院 Method for constructing sliding condition table
CN109358987B (en) * 2018-10-26 2019-09-24 黄淮学院 A kind of backup cluster based on two-stage data deduplication
CN109582640B (en) * 2018-11-15 2020-12-01 深圳市酷开网络科技有限公司 Sliding window-based data deduplication storage method and device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156727A (en) * 2011-04-01 2011-08-17 华中科技大学 Method for deleting repeated data by using double-fingerprint hash check
CN102629258A (en) * 2012-02-29 2012-08-08 浪潮(北京)电子信息产业有限公司 Repeating data deleting method and device
CN103150260A (en) * 2011-11-25 2013-06-12 华为数字技术(成都)有限公司 Method and device for deleting repeating data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8930307B2 (en) * 2011-09-30 2015-01-06 Pure Storage, Inc. Method for removing duplicate data from a storage array

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156727A (en) * 2011-04-01 2011-08-17 华中科技大学 Method for deleting repeated data by using double-fingerprint hash check
CN103150260A (en) * 2011-11-25 2013-06-12 华为数字技术(成都)有限公司 Method and device for deleting repeating data
CN102629258A (en) * 2012-02-29 2012-08-08 浪潮(北京)电子信息产业有限公司 Repeating data deleting method and device

Also Published As

Publication number Publication date
CN103930890B (en) 2015-09-23
CN103930890A (en) 2014-07-16

Similar Documents

Publication Publication Date Title
EP3340028B1 (en) Storage system deduplication
US10534547B2 (en) Consistent transition from asynchronous to synchronous replication in hash-based storage systems
EP3217298B1 (en) Data processing method and apparatus in cluster system
US9953107B2 (en) Memory system including key-value store
WO2014067063A1 (en) Duplicate data retrieval method and device
US9298726B1 (en) Techniques for using a bloom filter in a duplication operation
WO2015061995A1 (en) Data processing method, device, and duplication processor
US9569357B1 (en) Managing compressed data in a storage system
WO2013086969A1 (en) Method, device and system for finding duplicate data
CN105612518B (en) Method and system for autonomous memory search
EP2863310B1 (en) Data processing method and apparatus, and shared storage device
EP3376393B1 (en) Data storage method and apparatus
US10552044B2 (en) Storage apparatus, data processing method and storage system wherein compressed data is read in parallel, said data stored in buffer by size and read from said buffer, in order of when said data is stored in said buffer
CN110998537B (en) Expired backup processing method and backup server
WO2017020576A1 (en) Method and apparatus for file compaction in key-value storage system
US9977600B1 (en) Optimizing flattening in a multi-level data structure
US10921987B1 (en) Deduplication of large block aggregates using representative block digests
US9430639B2 (en) Data de-duplication in a non-volatile storage device responsive to commands based on keys transmitted to a host
US9292520B1 (en) Advanced virtual synthetic full backup synthesized from previous full-backups
WO2014094479A1 (en) Method and device for deleting duplicate data
US20210034674A1 (en) Cuckoo tree with duplicate key support
WO2011091581A1 (en) Method and device for storing and searching keyword
US10120875B1 (en) Method and system for detecting boundaries of data blocks for deduplication
WO2015081742A1 (en) Data writing method and device
JP6113816B1 (en) Information processing system, information processing apparatus, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13896276

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13896276

Country of ref document: EP

Kind code of ref document: A1