CN107944038B - Method and device for generating deduplication data - Google Patents

Method and device for generating deduplication data Download PDF

Info

Publication number
CN107944038B
CN107944038B CN201711336936.1A CN201711336936A CN107944038B CN 107944038 B CN107944038 B CN 107944038B CN 201711336936 A CN201711336936 A CN 201711336936A CN 107944038 B CN107944038 B CN 107944038B
Authority
CN
China
Prior art keywords
data
key value
row
search key
larger
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711336936.1A
Other languages
Chinese (zh)
Other versions
CN107944038A (en
Inventor
张钦
张黎敏
朱仲颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Dameng Database Co Ltd
Original Assignee
Shanghai Dameng Database Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Dameng Database Co Ltd filed Critical Shanghai Dameng Database Co Ltd
Priority to CN201711336936.1A priority Critical patent/CN107944038B/en
Publication of CN107944038A publication Critical patent/CN107944038A/en
Application granted granted Critical
Publication of CN107944038B publication Critical patent/CN107944038B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for generating deduplication data. Wherein the method comprises the following steps: in a sorted B-tree, positioning a first row of data meeting initial conditions according to the structure of the B-tree, and taking out the first row of data; generating a search key value according to the extracted first row of data, positioning the first row of data larger than the search key value according to the structure of the B tree, and extracting the first row of data; returning to generate a search key value according to the extracted first row of data until no data larger than the search key value exists in the B tree; and generating deduplication data according to the extracted first line data. The non-repeating data may be located using the structural features of the sorted B-tree without traversing all the rows of data in the B-tree structure. The data amount of processing can be reduced, and further, the deduplication processing time can be reduced.

Description

Method and device for generating deduplication data
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for generating deduplication data.
Background
Data deduplication is a data compression technique used to eliminate redundant data. In a typical deduplication process, the first data is compared to the stored data to detect duplicates, i.e., to identify or determine whether the first data is unique. Then, when the first data is identified as non-unique, the redundant first data is eliminated and replaced with a small reference pointing to the stored data.
Currently, the following algorithm can be adopted for data deduplication: firstly, acquiring a first line of data and placing the first line of data in a temporary cache region; step two: when a new line of data exists, the new line is compared with the data in the cache region one by one, the same data is found, and then the new data in the line can be discarded; if not, then putting the new line data into the buffer area; step three: and (5) attempting to acquire new data, and repeating the step two if the data is repeated. Otherwise, the data in the buffer is the data after the duplication is removed.
The above algorithm has the following problems: if the data without duplication is large, a large temporary space is required to store the collection. Furthermore, all the data to be processed needs to be traversed, and if the data to be processed is large, the method is time-consuming to adopt.
Another algorithm is to sort the data in advance. And (3) removing the duplicate of the sorted data: firstly, the data of the first row can be directly output, and the data of the first row is reserved; the second step is that: acquiring a new row, and if the new row is the same as the reserved row, discarding the new row; if the new row is different from the reserved row, outputting the new row, and setting the reserved new row as a reserved row; the third step: and (4) trying to acquire a new row, if the second step is repeated, and otherwise, ending the deduplication operation. The advantage of this algorithm is that it does not require a large temporary space, but still requires traversal of all the data to be processed. It is also very time consuming.
Disclosure of Invention
The embodiment of the invention provides a method and a device for generating deduplication data, which are used for solving the technical problem that in the prior art, data is deduplicated and all data to be processed needs to be traversed.
In a first aspect, an embodiment of the present invention provides a method for generating deduplication data, including:
in the sorted B tree, positioning a first row of data meeting initial conditions according to the structure of the B tree, and taking out the first row of data;
generating a search key value according to the extracted first row of data, positioning the first row of data larger than the search key value according to the structure of the B tree, and extracting the first row of data;
returning to generate a search key value according to the extracted first row of data until no data larger than the search key value exists in the B tree;
and generating deduplication data according to the extracted first line data.
In a second aspect, an embodiment of the present invention further provides an apparatus for generating deduplication data, including:
the starting positioning module is used for positioning a first row of data meeting the starting condition in the screening condition according to the structure of the B tree in the sorted B tree and taking out the first row of data;
the middle positioning module is used for generating a new search key value according to the key value of the taken first row of data, positioning the first row of data larger than the search key value according to the structure of the B tree, and taking out the first row of data;
the circular positioning module is used for returning and executing operations of generating a new search key value according to the key value of the extracted first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and extracting the first row of data until the new search key value is larger than the key value corresponding to the termination condition in the screening condition;
and the generating module is used for generating the deduplication data according to all the taken first line data.
According to the method and the device for generating the deduplication data, provided by the embodiment of the invention, the first row of data meeting the initial condition is positioned by utilizing the structural characteristics of the sorted B tree. And carrying out iterative replacement on the search key values according to the first row of data obtained by positioning until the generated new search key values are larger than the key values corresponding to the termination conditions in the screening conditions. And generating deduplication data according to the plurality of retrieved first line data. The non-repeating data may be located using the structural features of the sorted B-tree without traversing all the rows of data in the B-tree structure. The data amount of processing can be reduced, and further, the deduplication processing time can be reduced.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
fig. 1 is a flowchart of a method for generating deduplication data according to an embodiment of the present invention;
fig. 2 is a structural diagram of a B-tree in the method for generating deduplication data according to an embodiment of the present invention;
fig. 3 is a flowchart of a method for generating deduplication data according to a second embodiment of the present invention;
fig. 4 is a flowchart of a method for generating deduplication data according to a third embodiment of the present invention;
fig. 5 is a block diagram of a deduplication data generation apparatus according to a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart illustrating a method for generating deduplication data according to an embodiment of the present invention, where the method is suitable for use in a case of performing data deduplication on a sorted B-tree. May be performed by a de-duplicated data generating device, which may be implemented by means of hardware and/or software.
Referring to fig. 1, the method for generating deduplication data includes:
s110, in the sorted B tree, positioning the first row of data meeting the initial condition in the screening condition according to the structure of the B tree, and taking out the first row of data.
A B-Tree (B-tree) is a tree-like data structure that is capable of storing data, ordering it, and allowing lookups, sequential reads, insertions, and deletions to be run with O (log n) temporal complexity. A B-tree, in general, is a binary search tree in which a node may have more than 2 child nodes. The B-tree can be viewed as a 2-3 lookup tree, i.e., allowing M-1 child nodes per node. The root node has at least two child nodes. The other nodes have at least M/2 child nodes. The ordering of the B-trees can be achieved through the operations of inserting and deleting the B-trees. Each node in the sorted B-tree has M-1 key values (keys) and is typically arranged in ascending order, with values at the child nodes of the M-1 and M key values lying between the values corresponding to the M-1 and M key values.
Since the deduplicated data should include all non-repetitive data, the search needs to be performed from one side of the B-tree, usually starting from the smallest data line, and therefore, the row data with the smallest positioning key needs to be located.
The positioning the first row of data satisfying the initial condition according to the structure of the B-tree may include: and generating a search key value according to initial conditions, and positioning the first row of data larger than the search key value according to the structure of the B tree. In the sorted B-tree, the key values corresponding to all the row data are arranged according to the sequence, the row data corresponding to the key value smaller than the key value of the root node are arranged at the left leaf node, and the row data corresponding to the key value larger than the key value of the root node are arranged at the right leaf node. Therefore, the corresponding leaf node can be located by using the structural characteristics of the B-tree without traversing all the leaf nodes.
The initial condition may be a filtering condition. For example, the lower limit of the range of the set search key value may be set. Fig. 2 is a structural diagram of a B-tree in the method for generating deduplication data according to an embodiment of the present invention. For example: for the B-tree shown in FIG. 2, the filter condition may be greater than 3, or greater than 2, among other filter conditions.
The positioning the first row of data meeting the starting condition in the screening condition according to the structure of the B-tree may include: and generating a search key value according to initial conditions, and positioning the first row of data larger than the search key value according to the structure of the B tree. In the sorted B-tree, the key values corresponding to all the row data are arranged according to the sequence, the row data corresponding to the key value smaller than the key value of the root node are arranged at the left leaf node, and the row data corresponding to the key value larger than the key value of the root node are arranged at the right leaf node. Therefore, the corresponding leaf node can be located by using the structural characteristics of the B-tree without traversing all the leaf nodes.
The locating the first row of data larger than the lookup key value according to the structure of the B-tree may include: determining the leaf node where the first row of data larger than the search key value is located according to the search key value; and positioning a first row of data larger than the search key value in a page corresponding to the leaf node. The first row of data is selected because a key value larger than the lookup key value may correspond to multiple rows of data. The multiple rows of data may correspond to the same or different key values, and for the rows of data corresponding to different key values, the range of the search is not; for the data rows corresponding to the same key value, the data of the data rows are the same as the data of the first row, so that the data of the first row only needs to be selected.
After the first line of data satisfying the initial condition is located, it needs to be taken out to facilitate later generation of deduplication data.
And S120, generating a search key value according to the key value of the taken first row of data, positioning the first row of data larger than the search key value according to the structure of the B tree, and taking out the first row of data.
Taking out the first row of data only removes the duplicate of the data row corresponding to the first key value meeting the preset condition, and for the data rows of other key values, the duplicate removal is required to be continued. Therefore, in this embodiment, a new lookup key value may be generated according to the key value of the fetched first row of data. For example, the key value of the retrieved first row of data may be used as a new lookup key value. For example: if the key value of the taken first row of data is 2, the new search key value is 2; if the key value of the retrieved first row of data is 3, the new lookup key value is 3.
After determining a new lookup key value, a first row of data larger than the new lookup key value still needs to be located according to the structure of the B-tree, and the first row of data is taken out. The specific implementation method can be the same as the positioning data method. The new search key value is compared with the key value of the root node, the leaf node where the new search key value is located is determined, and whether data larger than the search key value exists in the page of the leaf node is searched. If the first row of data is larger than the search key value, locating the first row of data larger than the search key value; and if the search key value does not exist, positioning the first row of data larger than the search key value in the page corresponding to the sibling node close to the right side of the leaf node.
For a B tree with smaller repeated data, it may be directly searched for whether data larger than the search key value exists in the page where the key value corresponding to the first row of data is located. If the first row of data is larger than the search key value, locating the first row of data larger than the search key value; and if the search key value does not exist, positioning the first row of data larger than the search key value in the page corresponding to the sibling node close to the right side of the leaf node. And after the first row of data larger than the search key value is located, taking out the first row of data larger than the search key value.
And if the same row of data corresponds to a plurality of key values, the found first row of data is used as a search key value, and the first row of data is taken out. For example, assuming that the key values corresponding to the row data with the original key value of 1 in fig. 2 are (1,1), (1, 2), respectively, the first row (1,1) is taken out to be used as a new search key value, the row data corresponding to the key value of (1, 2) is found by searching in a manner greater than (1,1), and then (1, 2) is used as the new search key value.
S130, generating a new search key value according to the key value of the first row of data which is taken out recently, returning to execute the operation of positioning the first row of data which is larger than the new search key value according to the structure of the B tree and taking out the first row of data until the new search key value is larger than the key value corresponding to the termination condition in the screening condition.
For example, a new lookup key value is generated according to the key value of the first row of data that is recently fetched, and for example, the key value of the first row of data that is recently fetched may be used as the new lookup key value, and the first row of data that is larger than the new lookup key value is located according to the structure of the B-tree, and the first row of data is fetched. And generating a new search key value according to the first row of data which is taken out last time, repeating the operation of positioning the first row of data which is larger than the new search key value according to the structure of the B tree and taking out the first row of data. Until the generated new search key value is larger than the key value corresponding to the termination condition in the screening condition. The key value corresponding to the termination condition in the screening condition may be an upper limit key value in a given range in the screening condition, for example: if the filtering condition is greater than 2 and less than 4, the key value of the termination condition is 3, and if the search key value of the repeatedly calculated information is 4, the operation of locating the first row of data greater than the new search key value and taking out the first row of data is ended.
And S140, generating deduplication data according to all the extracted first line data.
Since any one of the first line data fetched in the above operation is different from the other fetched line data, it is equivalent to performing the deduplication processing on the same line data. Therefore, the deduplication data may be generated from all the fetched first line data. For example, all the fetched first row data may be synthesized to generate the deduplication data of the B-tree.
The present embodiment locates the first row of data satisfying the initial condition by using the structural features of the sorted B-tree. And carrying out iterative replacement on the search key values according to the first row of data obtained by positioning until the generated new search key values are larger than the key values corresponding to the termination conditions in the screening conditions. And generating deduplication data according to the plurality of retrieved first line data. The non-repeating data may be located using the structural features of the sorted B-tree without traversing all the rows of data in the B-tree structure. The data amount of processing can be reduced, and further, the deduplication processing time can be reduced.
In a preferred implementation manner of this embodiment, the step of searching for a key value until the new key value is greater than the key value corresponding to the termination condition in the screening condition may be specifically optimized as follows: and when the termination condition is lack of saving, until no line data larger than the new search key value exists in the B tree. If there is no termination condition, it may be considered that data corresponding to the maximum key value needs to be accessed for deduplication. Therefore, when there is no row data larger than the new lookup key value in the B tree, it indicates that the last row data taken out is the row data with the maximum key value. The search for new line data may be terminated. Missing data can be avoided.
Example two
Fig. 3 is a schematic flow chart of a method for generating deduplication data according to a second embodiment of the present invention. In this embodiment, the positioning to the first row of data larger than the lookup key value according to the structure of the B tree is specifically optimized as follows: determining the leaf node where the new search key value is located; searching a first row of data with a key value larger than the search key value in a page corresponding to the leaf node, and if the first row of data with the key value larger than the search key value is found, positioning the first row of data with the key value larger than the search key value in the page of the leaf node; otherwise, the first row of data larger than the search key value is positioned in the page corresponding to the brother node close to the right side of the leaf node.
Referring to fig. 3, the method for generating deduplication data includes:
s210, in the sorted B tree, positioning the first row of data meeting the initial condition in the screening condition according to the structure of the B tree, and taking out the first row of data.
And S220, generating a new search key value according to the key value of the taken first row of data.
And S230, determining the leaf node where the new search key value is located.
For example, the lookup key may be compared with the key value of the root node of the B-tree to determine the leaf node where the new lookup key is located. Still taking the B-tree shown in fig. 2 as an example, a root node of the B-tree may determine a leaf node where a search key value is located, when a new search key value is 3, a determination key may compare the new search key value 3 with a key value of the root node, and since the root node of the B-tree is 3,5, the key value is the same as the key value of the root node, and the B-tree has completed sorting, it may be determined that the leaf node where the new search key value is located is a middle leaf node of the root node; if the new search key value is 2, since the root node of the B-tree is 3,5, and 2 is less than 3, it may be determined that the leaf node where the new search key value is located is the left leaf node of the root node.
S240, searching the first row of data with the key value larger than the new search key value in the page corresponding to the leaf node, and if the first row of data with the key value larger than the new search key value is searched, positioning the first row of data with the key value larger than the new search key value in the page of the leaf node.
For a B-tree structure with relatively few duplicate data, the data rows corresponding to adjacent key values are typically stored in the same page. Therefore, after the page where the new search key value is located is determined, row data which is larger than the new search key value can be searched for in the page. If so, a first row of data larger than the new lookup key value may be located in the page of the leaf node.
Still taking the B tree in fig. 2 as an example, if the key value 4 is found in the page corresponding to the middle leaf node, and the first row of data greater than the search key value 3 is found, the first row of data greater than the search key value is located as the row of data corresponding to the key value 4.
Preferably, the searching for the first row of data with a key value larger than the key value in the page corresponding to the leaf node may include: and positioning the first row of data larger than the search key value in the page by adopting intra-page dichotomy. The dichotomy is also called as halving, and the basic idea is to store elements in a dictionary in an array (array) from small to large in order, firstly, a given value key is compared with a key (key) of the element at the middle position of the dictionary, and if the given value key is equal to the key, the searching is successful; otherwise, if the key is small, continuing dichotomy searching in the front half part of the dictionary, and if the key is large, continuing dichotomy searching in the rear half part of the dictionary. Thus, the search interval is reduced by half through one comparison, and the process is continued until the search is successful or fails. Dichotomy lookup is a more efficient lookup method that requires the dictionary to be sorted by key in the sequence table. For binary lookup, the table must be sorted in ascending order according to a particular search key, otherwise the search will not find the correct row. Still taking the B tree shown in fig. 2 as an example, if the new lookup key value is 3, when it is determined that the new lookup key value is 3, in the page corresponding to the middle leaf node, the key value in the middle of the page in the lookup is 3, so that when the key value 4 is found by looking up from the right half of the page, the first row of data larger than the new lookup key value 3 may be located as the row of data corresponding to the key value 4.
S250, otherwise, positioning the first row of data larger than the new search key value in the page corresponding to the brother node close to the right side of the leaf node.
If the page where the new lookup key value in the B-tree is located has more duplicate data, the page may not have data larger than the new lookup key value. Therefore, the first row of data larger than the search key needs to be located in the page corresponding to the sibling node close to the right side of the leaf node. Still taking the B tree in fig. 2 as an example, if the new lookup key value is 2, comparing the new lookup key value 2 with the key value of the root node, because the root nodes of the B tree are 3,5, and the B tree has completed sorting, it may be determined that the leaf node 2 where the new lookup key value is located in the left leaf node of the root node, the lookup key value is the first row of data larger than the new lookup key value in the page corresponding to the left leaf node, and because there is no data larger than the lookup key value 2 in the left leaf node, it is necessary to perform lookup in the leaf node adjacent to the right side of the left leaf node. Namely, the middle leaf node, searches for the first row of data with the key value larger than the search key value in the page corresponding to the middle leaf node, and locates the first row of data larger than the search key value 2.
And S260, returning to execute the operation of generating a new search key value according to the key value of the taken first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and taking out the first row of data until the new search key value is larger than the key value corresponding to the termination condition in the screening condition.
And S270, generating deduplication data according to all the extracted first line data.
In this embodiment, the positioning to the first row of data larger than the lookup key value according to the structure of the B tree is specifically optimized as follows: determining the leaf node where the new search key value is located; searching a first row of data with key values larger than the new search key value in the page corresponding to the leaf node, and if the first row of data with key values larger than the new search key value is searched, positioning the first row of data with key values larger than the new search key value in the page of the leaf node; otherwise, positioning the first row of data larger than the new search key value in the page corresponding to the sibling node close to the right side of the leaf node. The storage location of the data in the sorted B-tree may be used to quickly locate the first row of data that is larger than the generated new lookup key.
EXAMPLE III
Fig. 4 is a schematic flow chart of a method for generating deduplication data according to a third embodiment of the present invention. In this embodiment, the positioning of the first row of data meeting the initial condition in the screening condition according to the structure of the B-tree is specifically optimized as follows: and when the starting condition is default, positioning the first row of data of the leftmost leaf node of the B-tree.
Referring to fig. 4, the method for generating deduplication data includes:
s310, in the sorted B-tree, when the starting condition is lack, positioning the first row of data of the leaf node at the leftmost side of the B-tree, and taking out the first row of data.
In some cases, the range of deduplication in the B-tree is not set. This condition is commonly referred to as a start condition default. When the starting condition is default, it means that all row data in the B-tree needs to be deduplicated. In the B-tree, if its left sub-tree is not empty, the values of all nodes on the left sub-tree are smaller than the value of its root node. Thus, it can be determined that the smallest line data is located in the leftmost leaf node. Meanwhile, in the page corresponding to the leaf node, the row data are also sequentially arranged from small to large, so that the first row data of the leaf node at the leftmost side of the B-tree can be determined as the minimum data. Thus, the first row of data for the leftmost leaf node of the B-tree is located. And the first row of data is fetched.
S320, generating a new search key value according to the key value of the extracted first row of data, positioning the first row of data larger than the search key value according to the structure of the B-tree, and extracting the first row of data.
And S330, returning to execute the operation of generating a new search key value according to the key value of the taken first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and taking out the first row of data until the new search key value is larger than the key value corresponding to the termination condition in the screening condition.
And S340, generating deduplication data according to all the extracted first line data.
In this embodiment, the first row of data that satisfies the initial condition in the screening condition is located according to the structure of the B tree, and is specifically optimized as follows: and when the starting condition is default, positioning the first row of data of the leftmost leaf node of the B-tree. When the starting condition is lack of time, the minimum data line position can be accurately positioned.
Example four
Fig. 5 is a structural diagram of an apparatus for generating deduplication data according to a fourth embodiment of the present invention, and as shown in fig. 5, the apparatus includes:
a starting positioning module 410, configured to position, in the sorted B-tree, a first row of data that meets a starting condition in a screening condition according to a structure of the B-tree, and take out the first row of data;
the middle positioning module 420 is configured to generate a new search key value according to the key value of the retrieved first row of data, position the first row of data larger than the search key value according to the structure of the B tree, and retrieve the first row of data;
the circular positioning module 430 is configured to return to execute operations of generating a new search key value according to the key value of the extracted first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and extracting the first row of data until the new search key value is larger than the key value corresponding to the termination condition in the screening condition;
and a generating module 440, configured to generate deduplication data according to all the fetched first row data.
The device for generating deduplication data according to this embodiment locates the first row of data that satisfies the initial condition by using the structural features of the sorted B-tree. And carrying out iterative replacement on the search key values according to the first row of data obtained by positioning until the generated new search key values are larger than the key values corresponding to the termination conditions in the screening conditions. And generating deduplication data according to the plurality of retrieved first line data. The non-repeating data may be located using the structural features of the sorted B-tree without traversing all the rows of data in the B-tree structure. The data amount of processing can be reduced, and further, the deduplication processing time can be reduced.
On the basis of the above embodiments, the intermediate positioning module includes:
a leaf node determining unit, configured to determine a leaf node where the new lookup key value is located;
a locating unit, configured to search for a first row of data with a key value larger than the new search key value in a page corresponding to the leaf node, and if the first row of data with the key value larger than the new search key value is found, locate the first row of data with the key value larger than the new search key value in the page of the leaf node;
otherwise, the first row of data larger than the new search key value is positioned in the page corresponding to the brother node close to the right side of the leaf node.
On the basis of the above embodiments, the positioning unit includes:
and the positioning subunit is used for positioning the first row of data larger than the new search key value in the page by adopting intra-page dichotomy.
On the basis of the above embodiments, the intermediate positioning module includes:
and the positioning unit is used for generating a search key value according to the starting condition and positioning the first row of data larger than the search key value according to the structure of the B tree.
On the basis of the foregoing embodiments, the start positioning module includes:
and the starting default positioning unit is used for positioning the first row of data of the leaf node at the leftmost side of the B-tree when the starting condition is default.
On the basis of the above embodiments, the cyclic positioning module includes:
and the default termination unit is used for terminating the condition of lack of saving until no line data larger than the new search key value exists in the B tree.
The device for generating the deduplication data provided by the embodiment of the invention can execute the method for generating the deduplication data provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
It will be apparent to those skilled in the art that the modules or steps of the invention described above may be implemented by an apparatus as described above. Alternatively, the embodiments of the present invention may be implemented by programs executable by a computer device, so that they can be stored in a storage device and executed by a processor, where the programs may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.; or separately as individual integrated circuit modules, or as a single integrated circuit module from a plurality of modules or steps within them. Thus, the present invention is not limited to any specific combination of hardware and software.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (8)

1. A method for generating deduplication data, comprising:
in the sorted B tree, positioning a first row of data meeting the initial condition in the screening condition according to the structure of the B tree, and taking out the first row of data larger than the search key value;
generating a new search key value according to the key value of the extracted first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and extracting the first row of data larger than the new search key value;
returning to execute the operation of generating a new search key value according to the key value of the taken first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and taking out the first row of data larger than the new search key value until the new search key value is larger than the key value corresponding to the termination condition in the screening condition;
generating deduplication data according to all the taken first line data;
wherein the locating the first row of data larger than the new lookup key value according to the structure of the B-tree comprises:
determining the leaf node where the new search key value is located;
searching a first row of data with key values larger than the new search key value in the page corresponding to the leaf node, and if the first row of data with key values larger than the new search key value is searched, positioning the first row of data with key values larger than the new search key value in the page of the leaf node;
otherwise, the first row of data larger than the new search key value is positioned in the page corresponding to the brother node close to the right side of the leaf node.
2. The method of claim 1, wherein the searching for the first row of data with a key value larger than the new key value in the page corresponding to the leaf node comprises:
and positioning the first row of data larger than the new search key value in the page by adopting intra-page dichotomy.
3. The method of claim 1, wherein said locating the first row of data satisfying a starting condition in a filtering condition according to the structure of the B-tree comprises:
and when the starting condition is default, positioning the first row of data of the leftmost leaf node of the B-tree.
4. The method of claim 1, wherein the step of, until the new search key value is greater than a key value corresponding to a termination condition in the screening condition, comprises:
and when the termination condition is lack of saving, until no line data larger than the new search key value exists in the B tree.
5. The method of claim 1, wherein said locating the first row of data satisfying a starting condition in a filtering condition according to the structure of the B-tree comprises:
and generating a search key value according to the starting condition, and positioning the first row of data larger than the search key value according to the structure of the B tree.
6. An apparatus for generating deduplication data, comprising:
the starting positioning module is used for positioning a first row of data meeting the starting condition in the screening condition according to the structure of the B tree in the sorted B tree and taking out the first row of data larger than the search key value;
the middle positioning module is used for generating a new search key value according to the key value of the taken first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and taking out the first row of data;
the circular positioning module is used for returning and executing operations of generating a new search key value according to the key value of the extracted first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and extracting the first row of data larger than the new search key value until the new search key value is larger than the key value corresponding to the termination condition in the screening condition;
the generating module is used for generating duplication removing data according to all the taken first line data;
wherein, the middle positioning module comprises:
a leaf node determining unit, configured to determine a leaf node where the new lookup key value is located;
a locating unit, configured to search for a first row of data with a key value larger than the new search key value in a page corresponding to the leaf node, and if the first row of data with the key value larger than the new search key value is found, locate the first row of data with the key value larger than the new search key value in the page of the leaf node;
otherwise, the first row of data larger than the new search key value is positioned in the page corresponding to the brother node close to the right side of the leaf node.
7. The apparatus of claim 6, wherein the positioning unit comprises:
and the positioning subunit is used for positioning the first row of data larger than the new search key value in the page by adopting intra-page dichotomy.
8. The apparatus of claim 6, wherein the intermediate positioning module comprises:
and the positioning unit is used for generating a search key value according to the starting condition and positioning the first row of data larger than the search key value according to the structure of the B tree.
CN201711336936.1A 2017-12-14 2017-12-14 Method and device for generating deduplication data Active CN107944038B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711336936.1A CN107944038B (en) 2017-12-14 2017-12-14 Method and device for generating deduplication data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711336936.1A CN107944038B (en) 2017-12-14 2017-12-14 Method and device for generating deduplication data

Publications (2)

Publication Number Publication Date
CN107944038A CN107944038A (en) 2018-04-20
CN107944038B true CN107944038B (en) 2020-11-10

Family

ID=61943242

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711336936.1A Active CN107944038B (en) 2017-12-14 2017-12-14 Method and device for generating deduplication data

Country Status (1)

Country Link
CN (1) CN107944038B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6278992B1 (en) * 1997-03-19 2001-08-21 John Andrew Curtis Search engine using indexing method for storing and retrieving data
CN102609442A (en) * 2010-12-28 2012-07-25 微软公司 Adaptive Index for Data Deduplication
CN105528367A (en) * 2014-09-30 2016-04-27 华东师范大学 A method for storage and near-real time query of time-sensitive data based on open source big data
US20170060898A1 (en) * 2015-08-27 2017-03-02 Vmware, Inc. Fast file clone using copy-on-write b-tree
CN107003935A (en) * 2014-11-20 2017-08-01 国际商业机器公司 Optimize database duplicate removal
CN107391034A (en) * 2017-07-07 2017-11-24 华中科技大学 A kind of duplicate data detection method based on local optimization

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6278992B1 (en) * 1997-03-19 2001-08-21 John Andrew Curtis Search engine using indexing method for storing and retrieving data
CN102609442A (en) * 2010-12-28 2012-07-25 微软公司 Adaptive Index for Data Deduplication
CN105528367A (en) * 2014-09-30 2016-04-27 华东师范大学 A method for storage and near-real time query of time-sensitive data based on open source big data
CN107003935A (en) * 2014-11-20 2017-08-01 国际商业机器公司 Optimize database duplicate removal
US20170060898A1 (en) * 2015-08-27 2017-03-02 Vmware, Inc. Fast file clone using copy-on-write b-tree
CN107391034A (en) * 2017-07-07 2017-11-24 华中科技大学 A kind of duplicate data detection method based on local optimization

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种处理B+树重复键值的方法;徐逸文等;《计算机工程》;20090331;第35卷(第5期);第25-27页 *

Also Published As

Publication number Publication date
CN107944038A (en) 2018-04-20

Similar Documents

Publication Publication Date Title
US11762876B2 (en) Data normalization using data edge platform
CN110147204B (en) Metadata disk-dropping method, device and system and computer-readable storage medium
US20100023514A1 (en) Tokenization platform
CN108228799B (en) Object index information storage method and device
JP6065844B2 (en) Index scanning device and index scanning method
JP2017526021A (en) Error correction apparatus and method for data retrieval
WO2018161548A1 (en) Search method based on binary code trie
CN110888837B (en) Object storage small file merging method and device
KR102179855B1 (en) Web page deduplication method and apparatus
CN110263104B (en) JSON character string processing method and device
US20220005546A1 (en) Non-redundant gene set clustering method and system, and electronic device
US20220139506A1 (en) Method for automatically collecteing and matching of laboratory data
CN110457348B (en) Data processing method and device
CN107944038B (en) Method and device for generating deduplication data
US11816245B2 (en) Method for analysis on interim result data of de-identification procedure, apparatus for the same, computer program for the same, and recording medium storing computer program thereof
US9928274B2 (en) Dynamically adjust duplicate skipping method for increased performance
CN110321346B (en) Method and system for realizing character string hash table
CN110362669B (en) Method suitable for fast keyword retrieval
JP5670993B2 (en) Reconstruction apparatus, method and program for tree structure by single path aggregation
CN110543622A (en) Text similarity detection method and device, electronic equipment and readable storage medium
US20210011685A1 (en) System and Method for Storing Data Records
CN107391666B (en) Method and device for generating composite index key value
CN111723266A (en) Mass data processing method and device
JP5628365B2 (en) Search device
CN116821146B (en) Apache Iceberg-based data list updating method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant