CN107944038A - A kind of generation method and device of duplicate removal data - Google Patents

A kind of generation method and device of duplicate removal data Download PDF

Info

Publication number
CN107944038A
CN107944038A CN201711336936.1A CN201711336936A CN107944038A CN 107944038 A CN107944038 A CN 107944038A CN 201711336936 A CN201711336936 A CN 201711336936A CN 107944038 A CN107944038 A CN 107944038A
Authority
CN
China
Prior art keywords
row data
key value
data
tree
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711336936.1A
Other languages
Chinese (zh)
Other versions
CN107944038B (en
Inventor
张钦
张黎敏
朱仲颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Dameng Database Co Ltd
Original Assignee
Shanghai Dameng Database Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Dameng Database Co Ltd filed Critical Shanghai Dameng Database Co Ltd
Priority to CN201711336936.1A priority Critical patent/CN107944038B/en
Publication of CN107944038A publication Critical patent/CN107944038A/en
Application granted granted Critical
Publication of CN107944038B publication Critical patent/CN107944038B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of generation method and device of duplicate removal data.Wherein, the described method includes:Tie up in ordering B-tree, the first row data of primary condition are met according to the structure positioning of the B-tree, and take out the first row data;Generated and found key value according to the first row data of taking-up, and according to the structure positioning of the B-tree to the first row data to find key value more than described in, and take out the first row data;Return and found key value according to the generation of the first row data of taking-up, until being not greater than the data to find key value in the B-tree;Duplicate removal data are generated according to the first row data of taking-up.The architectural feature positioning non-duplicate data for the B-tree that sorted can be utilized, without traveling through the All Datarows in B-tree structure.The data volume of processing can be reduced, and then reduces duplicate removal processing time.

Description

A kind of generation method and device of duplicate removal data
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of generation method and device of duplicate removal data.
Background technology
Data deduplication is a kind of data compression technique for being used to eliminate redundant data.During typical duplicate removal, by One data are compared with storing data to detect copy, i.e. identification determines whether the first data are unique.Then, when first When data are identified as not unique, the first data of redundancy are eliminated and are substituted for the small reference for being directed toward storage data.
At present, data deduplication can use following algorithm:The first row data are obtained first to be placed in temporary buffer;Step Two:When having the new data of a line, newline and the data in buffer area one by one compared with, find identical data, then this journey newly counts According to can abandon;Newline data are so put into buffering area if it is not found,;Step 3:Attempt to obtain new data, if There is repetition, then repeat step two.Otherwise, the data in buffering area are exactly the data after duplicate removal.
There are the following problems for above-mentioned algorithm:If very big without the data repeated, very big temporary space is needed to store collection Close.In addition the pending data of traversal institute are needed, if pending data are very much, can be taken very much using this method.
Another algorithm is that data are ranked up in advance.To the data deduplication after sequence:The first row data first It can directly export, retain this row data;Second step:A newline is obtained, if newline is identical with the row retained, then newline Abandon;If newline is different from reservation line, then output newline, and retain newline and be arranged to reservation line;3rd step:Trial obtains Newline is taken, if repeating second step, otherwise deduplication operation terminates.The advantages of algorithm is temporary space that need not be very big, But still need the pending data of traversal institute.It is equally very time-consuming.
The content of the invention
An embodiment of the present invention provides a kind of generation method and device of duplicate removal data, to solve in the prior art to data Carrying out duplicate removal needs to travel through the technical problem of all pending datas.
In a first aspect, an embodiment of the present invention provides a kind of generation method of duplicate removal data, including:
In ordering B-tree, the first row data of primary condition are met according to the structure positioning of the B-tree, and are taken out The first row data;
Found key value according to the generation of the first row data of taking-up, and looked into according to the structure positioning of the B-tree to more than described The first row data of key assignments are looked for, and take out the first row data;
Return and found key value according to the generation of the first row data of taking-up, until being not greater than the key for searching in the B-tree The data of value;
Duplicate removal data are generated according to the first row data of taking-up.
Second aspect, the embodiment of the present invention additionally provide a kind of generating means of duplicate removal data, including:
Locating module is originated, in ordering B-tree, meeting according to the structure positioning of the B-tree in screening conditions The first row data of initial conditions, and take out the first row data;
Interfix module, key assignments generation for the first row data according to taking-up is new to find key value, and according to institute The structure positioning of B-tree is stated to the first row data to find key value more than described in, and takes out the first row data;
Locating module is circulated, is found key value for returning to perform according to the key assignments generations of the first row data of taking-up is new, And according to the structure positioning of the B-tree to more than the new the first row data to find key value, and take out first line number According to operation, until the new key assignments corresponding more than end condition in screening conditions that finds key value;
Generation module, for generating duplicate removal data according to the first row data of all taking-ups.
The generation method and device of duplicate removal data provided in an embodiment of the present invention, by using the B-tree for having completed sequence Architectural feature, positioning meet the first row data of primary condition.And according to the obtained the first row data of positioning to find key value into Row iteration is replaced, until the new key assignments corresponding more than end condition in screening conditions that finds key value of generation.And according to taking-up Multiple the first row data generation duplicate removal data.Can utilize sorted B-tree architectural feature positioning non-duplicate data, without time Go through the All Datarows in B-tree structure.The data volume of processing can be reduced, and then reduces duplicate removal processing time.
Brief description of the drawings
By reading the detailed description made to non-limiting example made with reference to the following drawings, of the invention is other Feature, objects and advantages will become more apparent upon:
Fig. 1 is the flow chart of the generation method for the duplicate removal data that the embodiment of the present invention one provides;
Fig. 2 be the embodiment of the present invention one provide duplicate removal data generation method in a B-tree structure chart;
Fig. 3 is the flow chart of the generation method of duplicate removal data provided by Embodiment 2 of the present invention;
Fig. 4 is the flow chart of the generation method for the duplicate removal data that the embodiment of the present invention three provides;
Fig. 5 is the structure chart of the generating means for the duplicate removal data that the embodiment of the present invention five provides.
Embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that in order to just It illustrate only part related to the present invention rather than entire infrastructure in description, attached drawing.
Embodiment one
The flow diagram of the generation method for the duplicate removal data that Fig. 1 provides for the embodiment of the present invention one, the side of the present embodiment Method is suitable for carrying out ordering B-tree the situation of data deduplication.It can be performed by the generating means of duplicate removal data, the device It can be realized by way of hardware and/or software.
Referring to Fig. 1, the generation method of the duplicate removal data, including:
S110, in ordering B-tree, meets of initial conditions in screening conditions according to the structure positioning of the B-tree Data line, and take out the first row data.
B-tree (B-tree) is a kind of tree data structure, it can store data, it is ranked up and is allowed with O The time complexity operation of (log n) is searched, the data structure that order reads, is inserted into and deletes.B-tree, is in short One node can possess the binary search tree of more than two child node.B-tree can be considered as a kind of 2-3 search trees, that is, allow each Node has M-1 child node.Root node at least two child nodes.Other nodes at least M/2 child node.By to B-tree Insertion and delete operation, it is possible to achieve the sequence to B-tree.Each node has M-1 key assignments (key) in B-tree after sequence, and And usually in ascending order, it is located at M-1 key assignments and the corresponding value of M key assignments positioned at the value of M-1 key assignments and the child node of M key assignments Between.
Since the data after duplicate removal should include all unduplicated data, therefore, it is necessary to be looked into from the side of B-tree Look for, be usually all to proceed by lookup from minimum data row, therefore, it is necessary to the row data of positioning key minimum.
The structure positioning according to the B-tree meets the first row data of primary condition, can include:According to initial strip Part is generated and found key value, and according to the structure positioning of the B-tree to more than the first row data to find key value.Due to having arranged In sequence B-tree, the corresponding key assignments of All Datarows has been carried out arranging in sequence, and less than the key-value pair of root node key value The row data answered are in left side leaf node, and row data corresponding more than the key assignments of root node key value are in right side leaf node.Cause This, can navigate to corresponding leaf node, without traveling through all leaf nodes using the architectural feature of B-tree.
The primary condition can be filter condition.It is exemplary, can be the lower limit of the scope to find key value of setting. Fig. 2 be the embodiment of the present invention one provide duplicate removal data generation method in a B-tree structure chart.Such as:For institute in Fig. 2 The B-tree shown, its filter condition may be greater than 3 or be more than 2 and other filter conditions.
The structure positioning according to the B-tree meets the first row data of initial conditions in screening conditions, can include: Generated and found key value according to primary condition, and according to the structure positioning of the B-tree to more than first line number to find key value According to.Closed since in the B-tree that sorted, the corresponding key assignments of All Datarows has been carried out arranging in sequence, and less than root node The corresponding row data of key assignments of key assignments are in left side leaf node, and row data corresponding more than the key assignments of root node key value are on right side Leaf node.Therefore, corresponding leaf node can be navigated to using the architectural feature of B-tree, without traveling through all leaf sections Point.
Wherein, the structure positioning according to the B-tree can be wrapped to more than the first row data to find key value Include:According to it is described find key value determine described be more than described in leaf node where the first row data that find key value;Described Positioning is more than the first row data to find key value in the corresponding page of leaf node.The first row data why are chosen, are Because multirow data may be corresponded to more than the key assignments to find key value.These multirow data may correspond to identical or different Key assignments, for the data row of the different key assignments of correspondence, it is not the scope of this lookup;And for the number of the identical key assignments of correspondence According to row, the data of itself and the first row are identical, therefore, only need to choose the first row data.
, it is necessary to be drawn off after the first row data for meeting primary condition are navigated to, to facilitate later stage generation to remove tuple According to.
S120, finds key value according to the generation of the key assignments of the first row data of taking-up, and is arrived according to the structure positioning of the B-tree More than the first row data to find key value, and take out the first row data.
The first row data are taken out simply to meeting that the corresponding data row of first key assignments of preset condition has carried out duplicate removal, and For the data row of other key assignments, it is also necessary to continue duplicate removal.Therefore, in the present embodiment, can be according to the first of taking-up The key assignments generation of row data is new to find key value.Exemplary, can be by the key assignments of the first row data of taking-up as new Find key value.Such as:The key assignments of the first row data of taking-up is 2, then new to find key value as 2;If the first row taken out The key assignments of data is 3, then new to find key value as 3.
After new find key value is determined, it is still desirable to according to the structure positioning of the B-tree to being more than the new lookup The first row data of key assignments, and take out the first row data.Its concrete methods of realizing can be with above-mentioned location data method phase Together.It will newly find key value compared with the key value of root node, determine the leaf section at the new place that finds key value Point, searches whether exist more than the data to find key value in the page of the leaf node.If it is present positioning institute State more than the first row data to find key value;If it does not exist, then in the brotgher of node institute on the right side of the leaf node The first row data to find key value described in described be more than are positioned in the corresponding page.
, can be directly in the page where the corresponding key assignments of the first row data for the less B-tree of repeated data Search whether to exist and be more than the data to find key value.If it is present find key value described in being more than described in positioning first Row data;If it does not exist, then it is more than in the page corresponding to the brotgher of node on the right side of the leaf node described in positioning The first row data to find key value.After the first row data to find key value described in described be more than are navigated to, described in taking-up More than the first row data to find key value.
If corresponding to multiple key assignments with data line, the first row data conduct equally found finds key value, and takes out institute State the first row data.For example, it is assumed that original key assignments in Fig. 2 is key assignments corresponding to 1 row data be respectively (1,1), (1, 1), (1,1), (1,1), (1,1), (1,2), (1,2), (1,2), (1,2), (1,2) the first row (1,1), is then taken out, as new Find key value, searched according to the mode more than (1,1), find the row data corresponding to (1,2) key assignments, and again by (1,2) conduct New finds key value.
S130, finds key value according to the generation of the key assignments for the first row data taken out recently is new, returns and perform basis The structure positioning of the B-tree takes out the behaviour of the first row data to more than the new the first row data to find key value Make, until the new key assignments corresponding more than end condition in screening conditions that finds key value.
Exemplary, find key value according to the generation of the key assignments for the first row data taken out recently is new, it is exemplary, It can find key value the key assignments of the first row data taken out recently as new, and according to the knot of the B-tree Structure is navigated to more than the new the first row data to find key value, and takes out the first row data.And taken out according to the last time The generation of the first row data it is new find key value, and repeat the above-mentioned structure positioning according to the B-tree to being more than new key for searching The first row data of value, and take out the operation of the first row data.Until the new of generation finds key value more than screening conditions The corresponding key assignments of middle end condition.The corresponding key assignments of end condition can be fixed given in screening conditions in the screening conditions Upper limit key assignments in scope, such as:Filter condition may be greater than 2, and during less than 4, then the key assignments of end condition is 3, is being repeated The information of calculating find key value for 4 when, terminate it is above-mentioned navigates to more than the new the first row data to find key value, and take Go out the operation of the first row data.
S140, duplicate removal data are generated according to the first row data of all taking-ups.
By row data not phases of any one in the first row data taken out in aforesaid operations and other taking-ups Together, duplicate removal processing has been carried out equivalent to identical row data.Therefore, it can be generated and gone according to the first row data of all taking-ups Tuple evidence.Exemplary, the first row data of all taking-ups can be synthesized, generate the duplicate removal data of the B-tree.
The present embodiment positions the first line number for meeting primary condition by using the architectural feature for the B-tree for having completed sequence According to.And replacement is iterated to finding key value according to the first row data that positioning obtains, until the new of generation finds key value greatly The corresponding key assignments of end condition in screening conditions.And generate duplicate removal data according to multiple the first row data of taking-up.Can profit Non-duplicate data is positioned with the architectural feature for the B-tree that sorted, without traveling through the All Datarows in B-tree structure.Place can be reduced The data volume of reason, and then reduce duplicate removal processing time.
In a preferred embodiment of the present embodiment, described it will can be screened up to new the finding key value is more than The corresponding key assignments of end condition, is specifically optimized in condition:When end condition is default, until being not greater than in the B-tree The new row data to find key value.If without end condition, it may be considered that duplicate removal needs to have access to largest key value institute Corresponding row data.Therefore, when the new row data to find key value are not greater than in the B-tree, then explanation finally takes The row data gone out are the row data of key assignments maximum.It can then terminate and search new row data.Can be to avoid missing data.
Embodiment two
Fig. 2 is the flow diagram of the generation method of duplicate removal data provided by Embodiment 2 of the present invention.It is more than the present embodiment State and optimize based on embodiment, in the present embodiment, the structure positioning according to B-tree is found key value to more than described The first row data, be specifically optimized for:Determine the leaf node at the new place that finds key value;Corresponded in the leaf node The page in find key value more than the first row data to find key value, if found, in the page of the leaf node Navigated in face more than the first row data to find key value;Otherwise, in the brotgher of node on the right side of the leaf node The first row data to find key value described in described be more than are positioned in the corresponding page.
Referring to Fig. 2, the generation method of the duplicate removal data, including:
S210, in ordering B-tree, meets of initial conditions in screening conditions according to the structure positioning of the B-tree Data line, and take out the first row data.
S220, finds key value according to the generation of the key assignments of the first row data of taking-up is new.
S230, determines the leaf node at the new place that finds key value.
Exemplary, it can will find key value compared with the key value of the B-tree root node, determine described new look into Look for the leaf node where key assignments.Still by taking the B-tree shown in Fig. 2 as an example, the institute that finds key value first can be determined by the root node of the B-tree Leaf node, it is new find key value for 3 when, can first determination key by new 3 keys with root node that find key value Value is compared, and since the root node of the B-tree is 3,5, the key assignments is identical with the key value of the root node, and the B Tree has been completed sequence, then the leaf node that can determine the new place to find key value is the centre of the root node Leaf node;If it is described it is new find key value as 2, since the root node of the B-tree is less than 3 for 3,5,2, then can determine institute The leaf node for stating the new place that finds key value is the lobus sinister child node of the root node.
S240, finds key value in the corresponding page of the leaf node more than new first line number to find key value According to if found, being navigated in the page of the leaf node more than the new the first row data to find key value.
For the relatively small number of B-tree structure of repeated data, the data row corresponding to adjacent key assignments is typically stored in same In the page.Therefore, after the page at the new place that finds key value is determined, it is to exist to be more than institute that can be searched in the page State the new row data to find key value.If it is present it can be navigated in the page of the leaf node new more than described The first row data to find key value.
Still by taking the B-tree in Fig. 2 as an example, key assignments 4 is found in the page corresponding to middle leaf node, for more than looking into The first row data of key assignments 3 are looked for, then the positioning line number for being more than the first row data to find key value corresponding to key assignments 4 According to.
Preferably, it is described to find key value in the corresponding page of the leaf node more than the first row to find key value Data, can include:Positioned using dichotomy in page in the page more than the first row data to find key value.Two points Also known as by half, its basic thought is that the element set in dictionary is stored in array (array) in an orderly manner from small to large to method, first will Set-point key, if equal, is searched successfully compared with the key code (key) of element on dictionary centre position;Otherwise, if key It is small, if it is big then to continue binary search key in dictionary first half, continue two points in dictionary latter half Method is searched.In this way, the lookup range by once relatively just reducing half, so goes on, until searching successfully or searching Failure.Binary search is a kind of higher lookup method of efficiency, it is desirable to which dictionary sorts in sequence list by key code.For two Divide and search, table must be arranged according to special search key come ascending order, and otherwise this search will not find correct row.Still with Exemplified by B-tree shown in Fig. 2, if new find key value as 3, determine it is new find key value for 3 when, in middle leaf node In the corresponding page, the key assignments in lookup among the page is 3, therefore, searches, is looking into from the right half part of the page When finding key assignments 4, can then position it is described more than the letter find key value 3 row of the first row data corresponding to key assignments 4 Data.
S250, otherwise, is more than in the page corresponding to the brotgher of node on the right side of the leaf node described in positioning The new the first row data to find key value.
If repeated data is more in the page at the new place that finds key value in the B-tree, may not in the page In the presence of more than the new data to find key value.Therefore, it is necessary to corresponding to the brotgher of node on the right side of the leaf node The page in position and described be more than the first row data to find key value.Still by taking the B-tree in Fig. 2 as an example, if new looks into It is 2 to look for key assignments, by it is described it is new find key value 2 with the key value of root node compared with, since the root node of the B-tree is 3, 5, and the B-tree has been completed sequence, then the leaf node 2 at the new place to find key value can be determined positioned at described The lobus sinister child node of root node, finds key value in the page corresponding to the lobus sinister child node as more than the new key for searching The first row data of value, due to be not greater than in lobus sinister child node it is described find key value 2 data, it is necessary to right in lobus sinister child node Searched in the adjacent leaf node in side.Leaf node among i.e., the key for searching in the page corresponding to middle leaf node Be worth for more than the first row data to find key value, and position it is described be more than it is described find key value 2 the first row data.
S260, returns to perform and finds key value according to the key assignments generations of the first row data of taking-up is new, and according to the B-tree Structure positioning to more than the new the first row data to find key value, and take out the operation of the first row data, until The new key assignments corresponding more than end condition in screening conditions that finds key value.
S270, duplicate removal data are generated according to the first row data of all taking-ups.
The present embodiment is by the way that the structure positioning according to B-tree to more than the first row data to find key value, is had Body is optimized for:Determine the leaf node at the new place that finds key value;The key for searching in the corresponding page of the leaf node Value is more than the new the first row data to find key value, if found, is navigated in the page of the leaf node More than the new the first row data to find key value;Otherwise, corresponding to the brotgher of node on the right side of the leaf node The page in positioning be more than the new the first row data to find key value.The storage position of data in the B-tree that sorted can be utilized Fast positioning is put to the new the first row data to find key value more than generation,
Embodiment three
Fig. 3 is the flow diagram of the generation method for the duplicate removal data that the embodiment of the present invention three provides.It is more than the present embodiment State and optimize based on embodiment, in the present embodiment, the structure positioning according to the B-tree meets to rise in screening conditions The first row data of beginning condition, are specifically optimized for:When initial conditions are default, B-tree leftmost side leaf node is positioned The first row data.
Referring to Fig. 3, the generation method of the duplicate removal data, including:
S310, in ordering B-tree, initial conditions for it is default when, position the of B-tree leftmost side leaf node Data line, and take out the first row data.
In some cases, the scope of duplicate removal in B-tree is not set.Such case is commonly known as initial conditions and lacks Save.When initial conditions are default, it is meant that need to carry out duplicate removal to All Datarows in B-tree.Due in B-tree, if its left side Subtree is not sky, then the value of all nodes is respectively less than the value of its root node on left subtree.Hence, it can be determined that minimum line number According in the leaf node of the leftmost side.Simultaneously as in the corresponding page of leaf node, the row data be also according to by What as low as big order was arranged in order, hence, it can be determined that the first row data of B-tree leftmost side leaf node are minimum data. Therefore, the first row data of B-tree leftmost side leaf node are positioned.And take out the first row data.
S320, finds key value according to the generation of the key assignments of the first row data of taking-up is new, and is determined according to the structure of the B-tree The first row data are taken out to the first row data to find key value more than described in position.
S330, returns to perform and finds key value according to the key assignments generations of the first row data of taking-up is new, and according to the B-tree Structure positioning to more than the new the first row data to find key value, and take out the operation of the first row data, until The new key assignments corresponding more than end condition in screening conditions that finds key value.
S340, duplicate removal data are generated according to the first row data of all taking-ups.
The structure positioning according to the B-tree by being met the first rows of initial conditions in screening conditions by the present embodiment Data, are specifically optimized for:When initial conditions are default, the first row data of B-tree leftmost side leaf node are positioned.Can be with When initial conditions are default, the data row position of minimum can be accurately navigated to.
Example IV
Fig. 5 is the structure chart of the generating means for the duplicate removal data that the embodiment of the present invention four provides, as shown in figure 5, the dress Put including:
Locating module 510 is originated, in ordering B-tree, meeting screening conditions according to the structure positioning of the B-tree The first row data of middle initial conditions, and take out the first row data;
Interfix module 520, key assignments generation for the first row data according to taking-up is new to find key value, and according to The structure positioning of the B-tree takes out the first row data to the first row data to find key value more than described in;
Locating module 530 is circulated, new key for searching is generated according to the key assignments of the first row data of taking-up for returning to perform Value, and arrived according to the structure positioning of the B-tree and be more than the first row data to find key value newly, and take out the first row The operation of data, until the new key assignments corresponding more than end condition in screening conditions that finds key value;
Generation module 540, for generating duplicate removal data according to the first row data of all taking-ups.
The generating means of duplicate removal data provided in this embodiment, it is fixed by using the architectural feature for the B-tree for having completed sequence Position meets the first row data of primary condition.And replacement is iterated to finding key value according to the first row data that positioning obtains, Until the new key assignments corresponding more than end condition in screening conditions that finds key value of generation.And according to multiple the first rows of taking-up Data generate duplicate removal data.The architectural feature positioning non-duplicate data for the B-tree that sorted can be utilized, without traveling through in B-tree structure All Datarows.The data volume of processing can be reduced, and then reduces duplicate removal processing time.
On the basis of the various embodiments described above, the interfix module, including:
Leaf node determination unit, for determining the leaf node at the new place that finds key value;
Positioning unit, new finds key value for finding key value in the corresponding page of the leaf node more than described The first row data, if found, navigated in the page of the leaf node more than it is described it is new find key value the Data line;
Otherwise, positioned in the page corresponding to the brotgher of node on the right side of the leaf node described new more than described The first row data to find key value.
On the basis of the various embodiments described above, the positioning unit, including:
Locator unit, for using dichotomy in page position in the page more than it is described newly find key value the Data line.
On the basis of the various embodiments described above, the interfix module, including:
Positioning unit, for being found key value according to initial conditions generation, and according to the structure positioning of the B-tree to being more than institute State the first row data to find key value.
On the basis of the various embodiments described above, the starting locating module, including:
Originate default positioning unit, for initial conditions for it is default when, position the of B-tree leftmost side leaf node Data line.
On the basis of the various embodiments described above, the circulation locating module, including:
Default termination unit, for end condition for it is default when, until the B-tree in be not greater than the new lookup The row data of key assignments.
The generating means for the duplicate removal data that the embodiment of the present invention is provided can perform what any embodiment of the present invention was provided The generation method of duplicate removal data, possesses the corresponding function module of execution method and beneficial effect.
Obviously, it will be understood by those skilled in the art that each module or each step of the invention described above can pass through as above institute The equipment stated is implemented.Alternatively, the embodiment of the present invention can be realized with the program that computer installation can perform, so as to incite somebody to action They store and are performed in the storage device by processor, and the program can be stored in a kind of computer-readable recording medium In, storage medium mentioned above can be read-only storage, disk or CD etc.;Or they are fabricated to each collection respectively Single integrated circuit module is fabricated to realize into circuit module, or by the multiple modules or step in them.In this way, this hair The bright combination for being not restricted to any specific hardware and software.
Note that it above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although being carried out by above example to the present invention It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also It can include other more equivalent embodiments, and the scope of the present invention is determined by scope of the appended claims.

Claims (10)

  1. A kind of 1. generation method of duplicate removal data, it is characterised in that including:
    In ordering B-tree, the first row data of initial conditions in screening conditions are met according to the structure positioning of the B-tree, And take out the first row data;
    Find key value according to the generation of the key assignments of the first row data of taking-up is new, and according to the structure positioning of the B-tree to being more than The first row data to find key value, and take out the first row data;
    Return to perform and find key value according to the key assignments generations of the first row data of taking-up is new, and determined according to the structure of the B-tree Position, which is arrived, is more than the new the first row data to find key value, and take out the operation of the first row data, until it is described newly Find key value more than the corresponding key assignments of end condition in screening conditions;
    Duplicate removal data are generated according to the first row data of all taking-ups.
  2. 2. according to the method described in claim 1, it is characterized in that, the structure positioning according to B-tree is to being more than the lookup The first row data of key assignments, including:
    Determine the leaf node at the new place that finds key value;
    Find key value in the corresponding page of the leaf node more than the new the first row data to find key value, if looked into Find, then navigated in the page of the leaf node more than the new the first row data to find key value;
    Otherwise, looked into described in being positioned in the page corresponding to the brotgher of node on the right side of the leaf node more than described new Look for the first row data of key assignments.
  3. 3. the according to the method described in claim 2, it is characterized in that, key for searching in the corresponding page of the leaf node Value is more than the new the first row data to find key value, including:
    Positioned using dichotomy in page in the page more than the new the first row data to find key value.
  4. 4. according to the method described in claim 1, it is characterized in that, described meet screening bar according to the structure positioning of the B-tree The first row data of initial conditions in part, including:
    When initial conditions are default, the first row data of B-tree leftmost side leaf node are positioned.
  5. It is 5. according to the method described in claim 1, it is characterized in that, described until described new finds key value more than screening conditions The corresponding key assignments of middle end condition, including:
    When end condition is default, until being not greater than the new row data to find key value in the B-tree.
  6. 6. according to the method described in claim 1, it is characterized in that, described meet screening bar according to the structure positioning of the B-tree The first row data of initial conditions in part, including:
    Found key value according to initial conditions generation, and arrived according to the structure positioning of the B-tree be more than described in find key value first Row data.
  7. A kind of 7. generating means of duplicate removal data, it is characterised in that including:
    Locating module is originated, in ordering B-tree, meeting to originate in screening conditions according to the structure positioning of the B-tree The first row data of condition, and take out the first row data;
    Interfix module, key assignments generation for the first row data according to taking-up is new to find key value, and according to the B-tree Structure positioning to more than the first row data to find key value, and take out the first row data;
    Locating module is circulated, is found key value for returning to perform according to the key assignments generations of the first row data of taking-up is new, and root According to the structure positioning of the B-tree to more than the new the first row data to find key value, and take out the first row data Operation, until the new key assignments corresponding more than end condition in screening conditions that finds key value;
    Generation module, for generating duplicate removal data according to the first row data of all taking-ups.
  8. 8. device according to claim 7, it is characterised in that the interfix module, including:
    Leaf node determination unit, for determining the leaf node at the new place that finds key value;
    Positioning unit, for finding key value in the corresponding page of the leaf node more than described new find key value first Row data, if found, navigate to more than the new the first row to find key value in the page of the leaf node Data;
    Otherwise, looked into described in being positioned in the page corresponding to the brotgher of node on the right side of the leaf node more than described new Look for the first row data of key assignments.
  9. 9. device according to claim 8, it is characterised in that the positioning unit, including:
    Locator unit, for being positioned using dichotomy in page in the page more than the new the first row to find key value Data.
  10. 10. device according to claim 7, it is characterised in that the interfix module, including:
    Positioning unit, is looked into for being found key value according to initial conditions generation, and according to the structure positioning of the B-tree to more than described Look for the first row data of key assignments.
CN201711336936.1A 2017-12-14 2017-12-14 Method and device for generating deduplication data Active CN107944038B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711336936.1A CN107944038B (en) 2017-12-14 2017-12-14 Method and device for generating deduplication data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711336936.1A CN107944038B (en) 2017-12-14 2017-12-14 Method and device for generating deduplication data

Publications (2)

Publication Number Publication Date
CN107944038A true CN107944038A (en) 2018-04-20
CN107944038B CN107944038B (en) 2020-11-10

Family

ID=61943242

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711336936.1A Active CN107944038B (en) 2017-12-14 2017-12-14 Method and device for generating deduplication data

Country Status (1)

Country Link
CN (1) CN107944038B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6278992B1 (en) * 1997-03-19 2001-08-21 John Andrew Curtis Search engine using indexing method for storing and retrieving data
CN102609442A (en) * 2010-12-28 2012-07-25 微软公司 Adaptive Index for Data Deduplication
CN105528367A (en) * 2014-09-30 2016-04-27 华东师范大学 A method for storage and near-real time query of time-sensitive data based on open source big data
US20170060898A1 (en) * 2015-08-27 2017-03-02 Vmware, Inc. Fast file clone using copy-on-write b-tree
CN107003935A (en) * 2014-11-20 2017-08-01 国际商业机器公司 Optimize database duplicate removal
CN107391034A (en) * 2017-07-07 2017-11-24 华中科技大学 A kind of duplicate data detection method based on local optimization

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6278992B1 (en) * 1997-03-19 2001-08-21 John Andrew Curtis Search engine using indexing method for storing and retrieving data
CN102609442A (en) * 2010-12-28 2012-07-25 微软公司 Adaptive Index for Data Deduplication
CN105528367A (en) * 2014-09-30 2016-04-27 华东师范大学 A method for storage and near-real time query of time-sensitive data based on open source big data
CN107003935A (en) * 2014-11-20 2017-08-01 国际商业机器公司 Optimize database duplicate removal
US20170060898A1 (en) * 2015-08-27 2017-03-02 Vmware, Inc. Fast file clone using copy-on-write b-tree
CN107391034A (en) * 2017-07-07 2017-11-24 华中科技大学 A kind of duplicate data detection method based on local optimization

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐逸文等: "一种处理B+树重复键值的方法", 《计算机工程》 *

Also Published As

Publication number Publication date
CN107944038B (en) 2020-11-10

Similar Documents

Publication Publication Date Title
Pitas Fast algorithms for running ordering and max/min calculation
US8396844B1 (en) Hierarchical method for storing data with improved compression
US7697518B1 (en) Integrated search engine devices and methods of updating same using node splitting and merging operations
KR101467589B1 (en) Dynamic fragment mapping
CN109325032B (en) Index data storage and retrieval method, device and storage medium
US7747599B1 (en) Integrated search engine devices that utilize hierarchical memories containing b-trees and span prefix masks to support longest prefix match search operations
JP2005525625A (en) Computer representation by data structure and related encoding / decoding method
US6735600B1 (en) Editing protocol for flexible search engines
KR100284778B1 (en) Insertion method of high dimensional index structure for content-based image retrieval
US20070094313A1 (en) Architecture and method for efficient bulk loading of a PATRICIA trie
CN110888837B (en) Object storage small file merging method and device
Tseng et al. Generating frequent patterns with the frequent pattern list
KR20170065374A (en) Method for Hash collision detection that is based on the sorting unit of the bucket
Lee et al. A Partitioned Signature File Structure for Multiattribute and Text Retrieval.
KR101705444B1 (en) Method for Hash-Join Using Sorting calculation, and computer program, and storage medium operating thereof
KR101070738B1 (en) Method and apparatus for multi-stage document clustering using ontology
CN107944038A (en) A kind of generation method and device of duplicate removal data
US8886677B1 (en) Integrated search engine devices that support LPM search operations using span prefix masks that encode key prefix length
CN110362669B (en) Method suitable for fast keyword retrieval
KR101748069B1 (en) Apparatus and method for performing graph summarization based on dynamic graph
KR100667741B1 (en) Indexing method of feature vector data space
JP5670993B2 (en) Reconstruction apparatus, method and program for tree structure by single path aggregation
CN102262526B (en) Ordered tree table segmented traversing method and software processing system
Chauhan et al. Finding similar items using lsh and bloom filter
KR100892406B1 (en) Method for Searching Information and System Therefor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant