CN107944038B

CN107944038B - Method and device for generating deduplication data

Info

Publication number: CN107944038B
Application number: CN201711336936.1A
Authority: CN
Inventors: 张钦; 张黎敏; 朱仲颖
Original assignee: Shanghai Dameng Database Co Ltd
Current assignee: Shanghai Dameng Database Co Ltd
Priority date: 2017-12-14
Filing date: 2017-12-14
Publication date: 2020-11-10
Anticipated expiration: 2037-12-14
Also published as: CN107944038A

Abstract

The invention discloses a method and a device for generating deduplication data. Wherein the method comprises the following steps: in a sorted B-tree, positioning a first row of data meeting initial conditions according to the structure of the B-tree, and taking out the first row of data; generating a search key value according to the extracted first row of data, positioning the first row of data larger than the search key value according to the structure of the B tree, and extracting the first row of data; returning to generate a search key value according to the extracted first row of data until no data larger than the search key value exists in the B tree; and generating deduplication data according to the extracted first line data. The non-repeating data may be located using the structural features of the sorted B-tree without traversing all the rows of data in the B-tree structure. The data amount of processing can be reduced, and further, the deduplication processing time can be reduced.

Description

Method and device for generating deduplication data

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for generating deduplication data.

Background

Data deduplication is a data compression technique used to eliminate redundant data. In a typical deduplication process, the first data is compared to the stored data to detect duplicates, i.e., to identify or determine whether the first data is unique. Then, when the first data is identified as non-unique, the redundant first data is eliminated and replaced with a small reference pointing to the stored data.

Currently, the following algorithm can be adopted for data deduplication: firstly, acquiring a first line of data and placing the first line of data in a temporary cache region; step two: when a new line of data exists, the new line is compared with the data in the cache region one by one, the same data is found, and then the new data in the line can be discarded; if not, then putting the new line data into the buffer area; step three: and (5) attempting to acquire new data, and repeating the step two if the data is repeated. Otherwise, the data in the buffer is the data after the duplication is removed.

The above algorithm has the following problems: if the data without duplication is large, a large temporary space is required to store the collection. Furthermore, all the data to be processed needs to be traversed, and if the data to be processed is large, the method is time-consuming to adopt.

Another algorithm is to sort the data in advance. And (3) removing the duplicate of the sorted data: firstly, the data of the first row can be directly output, and the data of the first row is reserved; the second step is that: acquiring a new row, and if the new row is the same as the reserved row, discarding the new row; if the new row is different from the reserved row, outputting the new row, and setting the reserved new row as a reserved row; the third step: and (4) trying to acquire a new row, if the second step is repeated, and otherwise, ending the deduplication operation. The advantage of this algorithm is that it does not require a large temporary space, but still requires traversal of all the data to be processed. It is also very time consuming.

Disclosure of Invention

The embodiment of the invention provides a method and a device for generating deduplication data, which are used for solving the technical problem that in the prior art, data is deduplicated and all data to be processed needs to be traversed.

In a first aspect, an embodiment of the present invention provides a method for generating deduplication data, including:

in the sorted B tree, positioning a first row of data meeting initial conditions according to the structure of the B tree, and taking out the first row of data;

generating a search key value according to the extracted first row of data, positioning the first row of data larger than the search key value according to the structure of the B tree, and extracting the first row of data;

returning to generate a search key value according to the extracted first row of data until no data larger than the search key value exists in the B tree;

and generating deduplication data according to the extracted first line data.

In a second aspect, an embodiment of the present invention further provides an apparatus for generating deduplication data, including:

the starting positioning module is used for positioning a first row of data meeting the starting condition in the screening condition according to the structure of the B tree in the sorted B tree and taking out the first row of data;

the middle positioning module is used for generating a new search key value according to the key value of the taken first row of data, positioning the first row of data larger than the search key value according to the structure of the B tree, and taking out the first row of data;

the circular positioning module is used for returning and executing operations of generating a new search key value according to the key value of the extracted first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and extracting the first row of data until the new search key value is larger than the key value corresponding to the termination condition in the screening condition;

and the generating module is used for generating the deduplication data according to all the taken first line data.

According to the method and the device for generating the deduplication data, provided by the embodiment of the invention, the first row of data meeting the initial condition is positioned by utilizing the structural characteristics of the sorted B tree. And carrying out iterative replacement on the search key values according to the first row of data obtained by positioning until the generated new search key values are larger than the key values corresponding to the termination conditions in the screening conditions. And generating deduplication data according to the plurality of retrieved first line data. The non-repeating data may be located using the structural features of the sorted B-tree without traversing all the rows of data in the B-tree structure. The data amount of processing can be reduced, and further, the deduplication processing time can be reduced.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:

fig. 1 is a flowchart of a method for generating deduplication data according to an embodiment of the present invention;

fig. 2 is a structural diagram of a B-tree in the method for generating deduplication data according to an embodiment of the present invention;

fig. 3 is a flowchart of a method for generating deduplication data according to a second embodiment of the present invention;

fig. 4 is a flowchart of a method for generating deduplication data according to a third embodiment of the present invention;

fig. 5 is a block diagram of a deduplication data generation apparatus according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart illustrating a method for generating deduplication data according to an embodiment of the present invention, where the method is suitable for use in a case of performing data deduplication on a sorted B-tree. May be performed by a de-duplicated data generating device, which may be implemented by means of hardware and/or software.

Referring to fig. 1, the method for generating deduplication data includes:

s110, in the sorted B tree, positioning the first row of data meeting the initial condition in the screening condition according to the structure of the B tree, and taking out the first row of data.

A B-Tree (B-tree) is a tree-like data structure that is capable of storing data, ordering it, and allowing lookups, sequential reads, insertions, and deletions to be run with O (log n) temporal complexity. A B-tree, in general, is a binary search tree in which a node may have more than 2 child nodes. The B-tree can be viewed as a 2-3 lookup tree, i.e., allowing M-1 child nodes per node. The root node has at least two child nodes. The other nodes have at least M/2 child nodes. The ordering of the B-trees can be achieved through the operations of inserting and deleting the B-trees. Each node in the sorted B-tree has M-1 key values (keys) and is typically arranged in ascending order, with values at the child nodes of the M-1 and M key values lying between the values corresponding to the M-1 and M key values.

Since the deduplicated data should include all non-repetitive data, the search needs to be performed from one side of the B-tree, usually starting from the smallest data line, and therefore, the row data with the smallest positioning key needs to be located.

The positioning the first row of data satisfying the initial condition according to the structure of the B-tree may include: and generating a search key value according to initial conditions, and positioning the first row of data larger than the search key value according to the structure of the B tree. In the sorted B-tree, the key values corresponding to all the row data are arranged according to the sequence, the row data corresponding to the key value smaller than the key value of the root node are arranged at the left leaf node, and the row data corresponding to the key value larger than the key value of the root node are arranged at the right leaf node. Therefore, the corresponding leaf node can be located by using the structural characteristics of the B-tree without traversing all the leaf nodes.

The initial condition may be a filtering condition. For example, the lower limit of the range of the set search key value may be set. Fig. 2 is a structural diagram of a B-tree in the method for generating deduplication data according to an embodiment of the present invention. For example: for the B-tree shown in FIG. 2, the filter condition may be greater than 3, or greater than 2, among other filter conditions.

The positioning the first row of data meeting the starting condition in the screening condition according to the structure of the B-tree may include: and generating a search key value according to initial conditions, and positioning the first row of data larger than the search key value according to the structure of the B tree. In the sorted B-tree, the key values corresponding to all the row data are arranged according to the sequence, the row data corresponding to the key value smaller than the key value of the root node are arranged at the left leaf node, and the row data corresponding to the key value larger than the key value of the root node are arranged at the right leaf node. Therefore, the corresponding leaf node can be located by using the structural characteristics of the B-tree without traversing all the leaf nodes.

The locating the first row of data larger than the lookup key value according to the structure of the B-tree may include: determining the leaf node where the first row of data larger than the search key value is located according to the search key value; and positioning a first row of data larger than the search key value in a page corresponding to the leaf node. The first row of data is selected because a key value larger than the lookup key value may correspond to multiple rows of data. The multiple rows of data may correspond to the same or different key values, and for the rows of data corresponding to different key values, the range of the search is not; for the data rows corresponding to the same key value, the data of the data rows are the same as the data of the first row, so that the data of the first row only needs to be selected.

After the first line of data satisfying the initial condition is located, it needs to be taken out to facilitate later generation of deduplication data.

And S120, generating a search key value according to the key value of the taken first row of data, positioning the first row of data larger than the search key value according to the structure of the B tree, and taking out the first row of data.

Taking out the first row of data only removes the duplicate of the data row corresponding to the first key value meeting the preset condition, and for the data rows of other key values, the duplicate removal is required to be continued. Therefore, in this embodiment, a new lookup key value may be generated according to the key value of the fetched first row of data. For example, the key value of the retrieved first row of data may be used as a new lookup key value. For example: if the key value of the taken first row of data is 2, the new search key value is 2; if the key value of the retrieved first row of data is 3, the new lookup key value is 3.

After determining a new lookup key value, a first row of data larger than the new lookup key value still needs to be located according to the structure of the B-tree, and the first row of data is taken out. The specific implementation method can be the same as the positioning data method. The new search key value is compared with the key value of the root node, the leaf node where the new search key value is located is determined, and whether data larger than the search key value exists in the page of the leaf node is searched. If the first row of data is larger than the search key value, locating the first row of data larger than the search key value; and if the search key value does not exist, positioning the first row of data larger than the search key value in the page corresponding to the sibling node close to the right side of the leaf node.

For a B tree with smaller repeated data, it may be directly searched for whether data larger than the search key value exists in the page where the key value corresponding to the first row of data is located. If the first row of data is larger than the search key value, locating the first row of data larger than the search key value; and if the search key value does not exist, positioning the first row of data larger than the search key value in the page corresponding to the sibling node close to the right side of the leaf node. And after the first row of data larger than the search key value is located, taking out the first row of data larger than the search key value.

And if the same row of data corresponds to a plurality of key values, the found first row of data is used as a search key value, and the first row of data is taken out. For example, assuming that the key values corresponding to the row data with the original key value of 1 in fig. 2 are (1,1), (1, 2), respectively, the first row (1,1) is taken out to be used as a new search key value, the row data corresponding to the key value of (1, 2) is found by searching in a manner greater than (1,1), and then (1, 2) is used as the new search key value.

S130, generating a new search key value according to the key value of the first row of data which is taken out recently, returning to execute the operation of positioning the first row of data which is larger than the new search key value according to the structure of the B tree and taking out the first row of data until the new search key value is larger than the key value corresponding to the termination condition in the screening condition.

For example, a new lookup key value is generated according to the key value of the first row of data that is recently fetched, and for example, the key value of the first row of data that is recently fetched may be used as the new lookup key value, and the first row of data that is larger than the new lookup key value is located according to the structure of the B-tree, and the first row of data is fetched. And generating a new search key value according to the first row of data which is taken out last time, repeating the operation of positioning the first row of data which is larger than the new search key value according to the structure of the B tree and taking out the first row of data. Until the generated new search key value is larger than the key value corresponding to the termination condition in the screening condition. The key value corresponding to the termination condition in the screening condition may be an upper limit key value in a given range in the screening condition, for example: if the filtering condition is greater than 2 and less than 4, the key value of the termination condition is 3, and if the search key value of the repeatedly calculated information is 4, the operation of locating the first row of data greater than the new search key value and taking out the first row of data is ended.

And S140, generating deduplication data according to all the extracted first line data.

Since any one of the first line data fetched in the above operation is different from the other fetched line data, it is equivalent to performing the deduplication processing on the same line data. Therefore, the deduplication data may be generated from all the fetched first line data. For example, all the fetched first row data may be synthesized to generate the deduplication data of the B-tree.

The present embodiment locates the first row of data satisfying the initial condition by using the structural features of the sorted B-tree. And carrying out iterative replacement on the search key values according to the first row of data obtained by positioning until the generated new search key values are larger than the key values corresponding to the termination conditions in the screening conditions. And generating deduplication data according to the plurality of retrieved first line data. The non-repeating data may be located using the structural features of the sorted B-tree without traversing all the rows of data in the B-tree structure. The data amount of processing can be reduced, and further, the deduplication processing time can be reduced.

In a preferred implementation manner of this embodiment, the step of searching for a key value until the new key value is greater than the key value corresponding to the termination condition in the screening condition may be specifically optimized as follows: and when the termination condition is lack of saving, until no line data larger than the new search key value exists in the B tree. If there is no termination condition, it may be considered that data corresponding to the maximum key value needs to be accessed for deduplication. Therefore, when there is no row data larger than the new lookup key value in the B tree, it indicates that the last row data taken out is the row data with the maximum key value. The search for new line data may be terminated. Missing data can be avoided.

Example two

Fig. 3 is a schematic flow chart of a method for generating deduplication data according to a second embodiment of the present invention. In this embodiment, the positioning to the first row of data larger than the lookup key value according to the structure of the B tree is specifically optimized as follows: determining the leaf node where the new search key value is located; searching a first row of data with a key value larger than the search key value in a page corresponding to the leaf node, and if the first row of data with the key value larger than the search key value is found, positioning the first row of data with the key value larger than the search key value in the page of the leaf node; otherwise, the first row of data larger than the search key value is positioned in the page corresponding to the brother node close to the right side of the leaf node.

Referring to fig. 3, the method for generating deduplication data includes:

s210, in the sorted B tree, positioning the first row of data meeting the initial condition in the screening condition according to the structure of the B tree, and taking out the first row of data.

And S220, generating a new search key value according to the key value of the taken first row of data.

And S230, determining the leaf node where the new search key value is located.

For example, the lookup key may be compared with the key value of the root node of the B-tree to determine the leaf node where the new lookup key is located. Still taking the B-tree shown in fig. 2 as an example, a root node of the B-tree may determine a leaf node where a search key value is located, when a new search key value is 3, a determination key may compare the new search key value 3 with a key value of the root node, and since the root node of the B-tree is 3,5, the key value is the same as the key value of the root node, and the B-tree has completed sorting, it may be determined that the leaf node where the new search key value is located is a middle leaf node of the root node; if the new search key value is 2, since the root node of the B-tree is 3,5, and 2 is less than 3, it may be determined that the leaf node where the new search key value is located is the left leaf node of the root node.

S240, searching the first row of data with the key value larger than the new search key value in the page corresponding to the leaf node, and if the first row of data with the key value larger than the new search key value is searched, positioning the first row of data with the key value larger than the new search key value in the page of the leaf node.

For a B-tree structure with relatively few duplicate data, the data rows corresponding to adjacent key values are typically stored in the same page. Therefore, after the page where the new search key value is located is determined, row data which is larger than the new search key value can be searched for in the page. If so, a first row of data larger than the new lookup key value may be located in the page of the leaf node.

Still taking the B tree in fig. 2 as an example, if the key value 4 is found in the page corresponding to the middle leaf node, and the first row of data greater than the search key value 3 is found, the first row of data greater than the search key value is located as the row of data corresponding to the key value 4.

Preferably, the searching for the first row of data with a key value larger than the key value in the page corresponding to the leaf node may include: and positioning the first row of data larger than the search key value in the page by adopting intra-page dichotomy. The dichotomy is also called as halving, and the basic idea is to store elements in a dictionary in an array (array) from small to large in order, firstly, a given value key is compared with a key (key) of the element at the middle position of the dictionary, and if the given value key is equal to the key, the searching is successful; otherwise, if the key is small, continuing dichotomy searching in the front half part of the dictionary, and if the key is large, continuing dichotomy searching in the rear half part of the dictionary. Thus, the search interval is reduced by half through one comparison, and the process is continued until the search is successful or fails. Dichotomy lookup is a more efficient lookup method that requires the dictionary to be sorted by key in the sequence table. For binary lookup, the table must be sorted in ascending order according to a particular search key, otherwise the search will not find the correct row. Still taking the B tree shown in fig. 2 as an example, if the new lookup key value is 3, when it is determined that the new lookup key value is 3, in the page corresponding to the middle leaf node, the key value in the middle of the page in the lookup is 3, so that when the key value 4 is found by looking up from the right half of the page, the first row of data larger than the new lookup key value 3 may be located as the row of data corresponding to the key value 4.

S250, otherwise, positioning the first row of data larger than the new search key value in the page corresponding to the brother node close to the right side of the leaf node.

If the page where the new lookup key value in the B-tree is located has more duplicate data, the page may not have data larger than the new lookup key value. Therefore, the first row of data larger than the search key needs to be located in the page corresponding to the sibling node close to the right side of the leaf node. Still taking the B tree in fig. 2 as an example, if the new lookup key value is 2, comparing the new lookup key value 2 with the key value of the root node, because the root nodes of the B tree are 3,5, and the B tree has completed sorting, it may be determined that the leaf node 2 where the new lookup key value is located in the left leaf node of the root node, the lookup key value is the first row of data larger than the new lookup key value in the page corresponding to the left leaf node, and because there is no data larger than the lookup key value 2 in the left leaf node, it is necessary to perform lookup in the leaf node adjacent to the right side of the left leaf node. Namely, the middle leaf node, searches for the first row of data with the key value larger than the search key value in the page corresponding to the middle leaf node, and locates the first row of data larger than the search key value 2.

And S260, returning to execute the operation of generating a new search key value according to the key value of the taken first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and taking out the first row of data until the new search key value is larger than the key value corresponding to the termination condition in the screening condition.

And S270, generating deduplication data according to all the extracted first line data.

In this embodiment, the positioning to the first row of data larger than the lookup key value according to the structure of the B tree is specifically optimized as follows: determining the leaf node where the new search key value is located; searching a first row of data with key values larger than the new search key value in the page corresponding to the leaf node, and if the first row of data with key values larger than the new search key value is searched, positioning the first row of data with key values larger than the new search key value in the page of the leaf node; otherwise, positioning the first row of data larger than the new search key value in the page corresponding to the sibling node close to the right side of the leaf node. The storage location of the data in the sorted B-tree may be used to quickly locate the first row of data that is larger than the generated new lookup key.

EXAMPLE III

Fig. 4 is a schematic flow chart of a method for generating deduplication data according to a third embodiment of the present invention. In this embodiment, the positioning of the first row of data meeting the initial condition in the screening condition according to the structure of the B-tree is specifically optimized as follows: and when the starting condition is default, positioning the first row of data of the leftmost leaf node of the B-tree.

Referring to fig. 4, the method for generating deduplication data includes:

s310, in the sorted B-tree, when the starting condition is lack, positioning the first row of data of the leaf node at the leftmost side of the B-tree, and taking out the first row of data.

In some cases, the range of deduplication in the B-tree is not set. This condition is commonly referred to as a start condition default. When the starting condition is default, it means that all row data in the B-tree needs to be deduplicated. In the B-tree, if its left sub-tree is not empty, the values of all nodes on the left sub-tree are smaller than the value of its root node. Thus, it can be determined that the smallest line data is located in the leftmost leaf node. Meanwhile, in the page corresponding to the leaf node, the row data are also sequentially arranged from small to large, so that the first row data of the leaf node at the leftmost side of the B-tree can be determined as the minimum data. Thus, the first row of data for the leftmost leaf node of the B-tree is located. And the first row of data is fetched.

S320, generating a new search key value according to the key value of the extracted first row of data, positioning the first row of data larger than the search key value according to the structure of the B-tree, and extracting the first row of data.

And S330, returning to execute the operation of generating a new search key value according to the key value of the taken first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and taking out the first row of data until the new search key value is larger than the key value corresponding to the termination condition in the screening condition.

And S340, generating deduplication data according to all the extracted first line data.

In this embodiment, the first row of data that satisfies the initial condition in the screening condition is located according to the structure of the B tree, and is specifically optimized as follows: and when the starting condition is default, positioning the first row of data of the leftmost leaf node of the B-tree. When the starting condition is lack of time, the minimum data line position can be accurately positioned.

Example four

Fig. 5 is a structural diagram of an apparatus for generating deduplication data according to a fourth embodiment of the present invention, and as shown in fig. 5, the apparatus includes:

a starting positioning module 410, configured to position, in the sorted B-tree, a first row of data that meets a starting condition in a screening condition according to a structure of the B-tree, and take out the first row of data;

the middle positioning module 420 is configured to generate a new search key value according to the key value of the retrieved first row of data, position the first row of data larger than the search key value according to the structure of the B tree, and retrieve the first row of data;

the circular positioning module 430 is configured to return to execute operations of generating a new search key value according to the key value of the extracted first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and extracting the first row of data until the new search key value is larger than the key value corresponding to the termination condition in the screening condition;

and a generating module 440, configured to generate deduplication data according to all the fetched first row data.

The device for generating deduplication data according to this embodiment locates the first row of data that satisfies the initial condition by using the structural features of the sorted B-tree. And carrying out iterative replacement on the search key values according to the first row of data obtained by positioning until the generated new search key values are larger than the key values corresponding to the termination conditions in the screening conditions. And generating deduplication data according to the plurality of retrieved first line data. The non-repeating data may be located using the structural features of the sorted B-tree without traversing all the rows of data in the B-tree structure. The data amount of processing can be reduced, and further, the deduplication processing time can be reduced.

On the basis of the above embodiments, the intermediate positioning module includes:

a leaf node determining unit, configured to determine a leaf node where the new lookup key value is located;

a locating unit, configured to search for a first row of data with a key value larger than the new search key value in a page corresponding to the leaf node, and if the first row of data with the key value larger than the new search key value is found, locate the first row of data with the key value larger than the new search key value in the page of the leaf node;

otherwise, the first row of data larger than the new search key value is positioned in the page corresponding to the brother node close to the right side of the leaf node.

On the basis of the above embodiments, the positioning unit includes:

and the positioning subunit is used for positioning the first row of data larger than the new search key value in the page by adopting intra-page dichotomy.

and the positioning unit is used for generating a search key value according to the starting condition and positioning the first row of data larger than the search key value according to the structure of the B tree.

On the basis of the foregoing embodiments, the start positioning module includes:

and the starting default positioning unit is used for positioning the first row of data of the leaf node at the leftmost side of the B-tree when the starting condition is default.

On the basis of the above embodiments, the cyclic positioning module includes:

and the default termination unit is used for terminating the condition of lack of saving until no line data larger than the new search key value exists in the B tree.

The device for generating the deduplication data provided by the embodiment of the invention can execute the method for generating the deduplication data provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

It will be apparent to those skilled in the art that the modules or steps of the invention described above may be implemented by an apparatus as described above. Alternatively, the embodiments of the present invention may be implemented by programs executable by a computer device, so that they can be stored in a storage device and executed by a processor, where the programs may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.; or separately as individual integrated circuit modules, or as a single integrated circuit module from a plurality of modules or steps within them. Thus, the present invention is not limited to any specific combination of hardware and software.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for generating deduplication data, comprising:

in the sorted B tree, positioning a first row of data meeting the initial condition in the screening condition according to the structure of the B tree, and taking out the first row of data larger than the search key value;

generating a new search key value according to the key value of the extracted first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and extracting the first row of data larger than the new search key value;

returning to execute the operation of generating a new search key value according to the key value of the taken first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and taking out the first row of data larger than the new search key value until the new search key value is larger than the key value corresponding to the termination condition in the screening condition;

generating deduplication data according to all the taken first line data;

wherein the locating the first row of data larger than the new lookup key value according to the structure of the B-tree comprises:

determining the leaf node where the new search key value is located;

searching a first row of data with key values larger than the new search key value in the page corresponding to the leaf node, and if the first row of data with key values larger than the new search key value is searched, positioning the first row of data with key values larger than the new search key value in the page of the leaf node;

2. The method of claim 1, wherein the searching for the first row of data with a key value larger than the new key value in the page corresponding to the leaf node comprises:

and positioning the first row of data larger than the new search key value in the page by adopting intra-page dichotomy.

3. The method of claim 1, wherein said locating the first row of data satisfying a starting condition in a filtering condition according to the structure of the B-tree comprises:

and when the starting condition is default, positioning the first row of data of the leftmost leaf node of the B-tree.

4. The method of claim 1, wherein the step of, until the new search key value is greater than a key value corresponding to a termination condition in the screening condition, comprises:

and when the termination condition is lack of saving, until no line data larger than the new search key value exists in the B tree.

5. The method of claim 1, wherein said locating the first row of data satisfying a starting condition in a filtering condition according to the structure of the B-tree comprises:

and generating a search key value according to the starting condition, and positioning the first row of data larger than the search key value according to the structure of the B tree.

6. An apparatus for generating deduplication data, comprising:

the starting positioning module is used for positioning a first row of data meeting the starting condition in the screening condition according to the structure of the B tree in the sorted B tree and taking out the first row of data larger than the search key value;

the middle positioning module is used for generating a new search key value according to the key value of the taken first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and taking out the first row of data;

the circular positioning module is used for returning and executing operations of generating a new search key value according to the key value of the extracted first row of data, positioning the first row of data larger than the new search key value according to the structure of the B tree, and extracting the first row of data larger than the new search key value until the new search key value is larger than the key value corresponding to the termination condition in the screening condition;

the generating module is used for generating duplication removing data according to all the taken first line data;

wherein, the middle positioning module comprises:

7. The apparatus of claim 6, wherein the positioning unit comprises:

8. The apparatus of claim 6, wherein the intermediate positioning module comprises: