CN104239301A - Data comparing method and device - Google Patents

Data comparing method and device Download PDF

Info

Publication number
CN104239301A
CN104239301A CN201310224623.2A CN201310224623A CN104239301A CN 104239301 A CN104239301 A CN 104239301A CN 201310224623 A CN201310224623 A CN 201310224623A CN 104239301 A CN104239301 A CN 104239301A
Authority
CN
China
Prior art keywords
comparison
item
group
compares
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310224623.2A
Other languages
Chinese (zh)
Other versions
CN104239301B (en
Inventor
刘祥斌
夏晨
杨少华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201310224623.2A priority Critical patent/CN104239301B/en
Publication of CN104239301A publication Critical patent/CN104239301A/en
Application granted granted Critical
Publication of CN104239301B publication Critical patent/CN104239301B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a data comparing method. The method comprises the steps of determining a first dataset and a second dataset to be compared, wherein each comparative object in the datasets comprises one or a plurality of comparative items; determining the type of the comparative items, wherein each type at least comprise a first class comparative item and a non first class comparative item; comparing a comparative object in the first dataset with a comparative object in a second dataset, if the first class comparative item of the first comparative object in the first data set is identical to a corresponding first class comparative item in a second comparative object in the second dataset, and a difference between the non first class comparative item of the first comparative object and a corresponding non first comparative item of the second comparative object, judging the first comparative object to be consistent with the second comparative object. The invention further provides a data comparing device. The efficient data comparison can be realized.

Description

A kind of data comparison method and device
Technical field
The application relates to data processing field, particularly relates to a kind of data comparison method and device.
Background technology
In common cross-platform data resettlement scene, when the calculating kernel of two platforms and data memory format change, usually need to carry out the Data Comparison before and after moving.If data volume is smaller, can manual verification be carried out, or the DIFF adopting some common (comparison in difference) instrument, as the diff order of LINUX system band.If the data volume needing contrast is mass data, such as, when needing the data acknowledgment number of contrast to have several hundred million, only can not complete by manual verification or conventional tool.
The shortcoming of prior art mainly concentrates on the following aspects:
1) general tool efficiency is low, and time resource cost is uncontrollable;
2) based on common hardware platforms such as units, hardware capabilities is not enough to support.
Summary of the invention
The technical matters that the application will solve is to provide a kind of data comparison method and device, improves mass data comparison efficiency.
In order to solve the problem, this application provides a kind of data comparison method, comprising:
Determine the first data set to be compared and the second data set, and each comparison other of described first data set and the second data centralization comprises and one or morely compares item;
Determine the described type comparing item, described type at least comprises: the first kind compares item and the non-first kind compares item;
The comparison other of described first data centralization and the comparison other of the second data centralization are compared, wherein, if the first kind of the first comparison other of described first data centralization compares item and the corresponding first kind in the second comparison other of described second data centralization, to compare item identical, and the non-first kind of described first comparison other compares item meets pre-conditioned with the corresponding non-first kind difference compared between item of described second comparison other, then judge that described first comparison other is consistent with described second comparison other.
Said method also can have following characteristics, and described comparing to the comparison other of described first data centralization and the comparison other of the second data centralization comprises:
The first kind extracting each comparison other compares item, as the key assignments of this comparison other;
The comparison other of described first data centralization is divided into groups by described key assignments, the comparison other of described second data centralization is divided into groups by described key assignments; Wherein, its key assignments of comparison other in same group is identical;
According to grouping, the comparison other of described first data set and the second data centralization is compared, and the key assignments of comparison other in mutually compare two groups is identical.
Said method also can have following characteristics, and described being compared by the comparison other of described first data set and the second data centralization according to grouping comprises:
Obtain first group according to described grouping from described first data centralization, obtain second group from described second data centralization, and the key assignments of each comparison other in described first group is identical with the key assignments of each comparison other in described second group;
The non-first kind of each comparison other in described first group is compared the corresponding non-first kind in item and described second group to compare item and compare;
If all non-first kind in described first group compares item and compares item with the corresponding non-first kind in described second group and meet described pre-conditioned, and described first group identical with the number of the comparison other in described second group, then described first group with described second group consistent, otherwise, report two groups of differences.
Said method also can have following characteristics, and the described non-first kind by each comparison other in described first group compares the corresponding non-first kind in item and described second group and compares before item compares, and also comprises:
The comparison item of each comparison other in described first group except described key assignments is sorted, the comparison item of each comparison other in described second group except described key assignments is sorted.
Said method also can have following characteristics, and described method also comprises: also comprise before comparing to the comparison other of described first data set and the second data centralization: be distributed in each example based on distributed computing framework according to described key assignments by the comparison other of the comparison other of described first data centralization and described second data centralization.
Said method also can have following characteristics, and the non-first kind of described first comparison other compares item and meets pre-conditioned comprising with the corresponding non-first kind difference compared between item of described second comparison other:
Item is compared to the arbitrary non-first kind of described first comparison other, obtains the difference that it compares item with the corresponding non-first kind of described second comparison other, or, obtain the ratio that described difference and the current non-first kind compared compare item;
Judge whether described difference or described ratio are less than predetermined threshold value, if be less than, then the non-first kind of described first comparison other and the current comparison of the second comparison other compares item and meets described pre-conditioned.
Said method also can have following characteristics, and the described first kind compares item and comprises the comparison item that data type is character string, and/or data type is the comparison item of integer; The described non-first kind compares item and comprises the comparison item that data type is floating number.
The application also provides a kind of comparing device, comprising:
Data set configuration module, for determining the first data set to be compared and the second data set, and each comparison other of described first data set and the second data centralization comprises and one or morely compares item; And determine the described type comparing item, described type at least comprises: the first kind compares item and the non-first kind compares item;
Comparing module, for comparing to the comparison other of described first data centralization and the comparison other of the second data centralization, wherein, if the first kind of the first comparison other of described first data centralization compares item and the corresponding first kind in the second comparison other of described second data centralization, to compare item identical, and the non-first kind of described first comparison other compares item meets pre-conditioned with the corresponding non-first kind difference compared between item of described second comparison other, then judge that described first comparison other is consistent with described second comparison other.
Said apparatus also can have following characteristics, and described comparing module comprises:
Key assignments extraction unit, compares item for the first kind extracting each comparison other, as the key assignments of this comparison other;
Grouped element, for being divided into groups by described key assignments by the comparison other of described first data centralization, divides into groups the comparison other of described second data centralization by described key assignments; Wherein, its key assignments of comparison other in same group is identical;
Comparing unit, for being compared by the comparison other of described first data set and the second data centralization according to grouping, and the key assignments of comparison other in mutually compare two groups is identical.
Said apparatus also can have following characteristics, the comparison other of described first data set and the second data centralization to compare according to grouping and comprises by described comparing unit: obtain first group according to described grouping from described first data centralization, obtain second group from described second data centralization, and the key assignments of each comparison other in described first group is identical with the key assignments of each comparison other in described second group; The non-first kind of each comparison other in described first group is compared the corresponding non-first kind in item and described second group to compare item and compare; If all non-first kind in described first group compares item and compares item with the corresponding non-first kind in described second group and meet described pre-conditioned, and described first group identical with the number of the comparison other in described second group, then described first group with described second group consistent, otherwise, report two groups of differences.
Said apparatus also can have following characteristics, described comparing unit also for, the non-first kind of each comparison other in described first group being compared the corresponding non-first kind in item and described second group compares before item compares, the comparison item of each comparison other in described first group except described key assignments is sorted, the comparison item of each comparison other in described second group except described key assignments is sorted.
Said apparatus also can have following characteristics, and described comparing module also comprises distribution module, for being distributed in each example based on distributed computing framework according to described key assignments by the comparison other of the comparison other of described first data centralization and described second data centralization.
Said apparatus also can have following characteristics, described comparing module according to judge as under type the non-first kind of the first comparison other compare item with as described in the difference that compares between item of the corresponding non-first kind of the second comparison other whether meet pre-conditioned:
Item is compared to the arbitrary non-first kind of described first comparison other, obtains the difference that it compares item with the corresponding non-first kind of described second comparison other, or, obtain the ratio that described difference and the current non-first kind compared compare item; Judge whether described difference or described ratio are less than predetermined threshold value, if be less than, then the non-first kind of described first comparison other and the current comparison of the second comparison other compares item and meets described pre-conditioned.
Said apparatus also can have following characteristics, and the described first kind compares item and comprises the comparison item that data type is character string, and/or data type is the comparison item of integer; The described non-first kind compares item and comprises the comparison item that data type is floating number.
The application comprises following advantage:
1, based on cloud computing distributed platform, can be implemented in efficient, the rapid data comparison under large data background.
2, can adjust flexibly (accurate or fuzzy) data precision.
Certainly, the arbitrary product implementing the application might not need to reach above-described all advantages simultaneously.
Accompanying drawing explanation
Fig. 1 is the embodiment of the present application one data comparison method process flow diagram;
Fig. 2 is the embodiment of the present application two comparing device block diagram.
Embodiment
For making the object of the application, technical scheme and advantage clearly understand, hereinafter will by reference to the accompanying drawings the embodiment of the application be described in detail.It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combination in any mutually.
In addition, although show logical order in flow charts, in some cases, can be different from the step shown or described by order execution herein.
Embodiment one
The present embodiment provides a kind of data comparison method, as shown in Figure 1, comprising:
Step 101, determines the first data set to be compared and the second data set, and each comparison other of described first data set and the second data centralization comprises and one or morely compares item;
Step 102, determine the described type comparing item, described type at least comprises: the first kind compares item and the non-first kind compares item;
Step 103, the comparison other of described first data centralization and the comparison other of the second data centralization are compared, wherein, if the first kind of the first comparison other of described first data centralization compares item and the corresponding first kind in the second comparison other of described second data centralization, to compare item identical, and the non-first kind of described first comparison other compares item meets pre-conditioned with the corresponding non-first kind difference compared between item of described second comparison other, then judge that described first comparison other is consistent with described second comparison other.
In the present embodiment, different comparison items is distinguished, some compares item and needs exact matching, some compares item and only needs fuzzy matching, by the manner of comparison of this difference, when some numerical requirements are not high, relax and compare requirement, difference is also thought identical in pre-conditioned, accelerates comparison speed, improve comparison efficiency.
In the present embodiment, if compared two tables of data, comparison other can be a line in each tables of data, relatively item is each element in this row, table 1 in such as subsequent embodiment, often row is as a comparison other, and the province in this row, city, month, sales volume are that 4 of this comparison other compare item.
The first kind compares item and the non-first kind and compares item and can set as required, such as, can be the comparison item of character string by data type, and/or data type is that the comparison item of integer is set as that the first kind compares item; Be that the comparison item of floating number is set as that the non-first kind compares item by data type, be such as set as that Equations of The Second Kind compares item.Certainly, actual using, can carry out thinner division, such as, is that the comparison item of single precision floating datum is divided into the first kind and compares item by data type, is that the comparison item of double-precision floating points is divided into Equations of The Second Kind and compares item by data type; Or, divide not in accordance with data type, and divide according to the particular content comparing item, such as special identification comparison condition (non-equivalent identification, for example (,) value be " Hangzhou ", the value of another side is " Zhejiang ", then regard as consistent), and when the composition structure of data field can manually judge in advance, can not divide according to data type, it is that the first kind compares item that direct specific data concentrates some to compare item, and some compares item is that Equations of The Second Kind compares item.Or divide more polymorphic type, such as the first kind compares item, and Equations of The Second Kind compares item, the 3rd class compares item etc.
In a kind of alternatives of the present embodiment, in step 103, the non-first kind of described first comparison other compares item and meets pre-conditioned comprising with the corresponding non-first kind difference compared between item of described second comparison other:
Item is compared to the arbitrary non-first kind of described first comparison other, obtains the difference that it compares item with the corresponding non-first kind of described second comparison other, or, obtain the ratio that described difference and the current non-first kind compared compare item;
Judge whether described difference or described ratio are less than predetermined threshold value, if be less than, then the non-first kind of described first comparison other and the current comparison of the second comparison other compares item and meets described pre-conditioned.
Above-mentioned predetermined threshold value is a kind of example, also other judgment modes can be taked, such as, item is compared to the arbitrary non-first kind in described first comparison other, obtain it and compare item with the corresponding non-first kind in described second comparison other, define an active domain respectively, as long as the non-first kind of both sides compares item fall into predefined active domain, then the non-first kind thinking in described first comparison other compares item and compares item with the corresponding non-first kind in described second comparison other and meet pre-conditioned.
In a kind of alternatives of the present embodiment, step 103 specifically comprises:
The first kind extracting each comparison other compares item, as the key assignments of this comparison other;
The comparison other of described first data centralization is divided into groups by described key assignments, the comparison other of described second data centralization is divided into groups by described key assignments; Wherein, its key assignments of comparison other in same group is identical;
According to grouping, the comparison other of described first data set and the second data centralization is compared, and the key assignments of comparison other in mutually compare two groups is identical.
In a kind of alternatives of the present embodiment, described being compared by the comparison other of described first data set and the second data centralization according to grouping comprises:
Obtain first group according to described grouping from described first data centralization, obtain second group from described second data centralization, and the key assignments of each comparison other in described first group is identical with the key assignments of each comparison other in described second group;
The non-first kind of each comparison other in described first group is compared the corresponding non-first kind in item and described second group to compare item and compare;
If all non-first kind in described first group compares item and compares item with the corresponding non-first kind in described second group and meet described pre-conditioned, and described first group identical with the number of the comparison other in described second group, then described first group with described second group consistent, otherwise, report two groups of differences.
In a kind of alternatives of the present embodiment, the described non-first kind by each comparison other in described first group compares the corresponding non-first kind in item and described second group and compares before item compares, and also can comprise:
The comparison item of each comparison other in described first group except described key assignments is sorted, the comparison item of each comparison other in described second group except described key assignments is sorted.
Such as, first group comprises comparison other Ai, i=1..4, and second group comprises comparison other Bj, j=1...4; Ai comprises and compares ai1, ai2, ai3, an ai4; Bj comprises and compares bj1, bj2, bj3, a bj4; And ai1, ai2, bj1, bj2 are the first kind compares item, ai3, ai4, bj3, bj4 are that Equations of The Second Kind compares item; Ai1, ai2 combination is as key assignments, and with key assignments bj1, bj2 is identical.
First group and second group when comparing, judge first group identical with the data of the comparison other of second group, be 4, then only need compare ai3, ai4 and bj3, bj4, when comparing ai3 and bj3, first sort ai3 by numerical values recited, by bj3 by numerical values recited sequence, compare ai3 again after sequence, bj3.The comparative approach of ai4, bj4 and the comparative approach of ai3, bj3 similar.
In a kind of alternatives of the present embodiment, the comparison of described first data set and the second data set is realized based on cloud computing mode, concrete, before the comparison other of described first data set and the second data centralization is compared, the comparison other of the comparison other of described first data centralization and described second data centralization is distributed in each example according to described key assignments based on distributed computing framework, in each example, realizes the concrete comparison of comparison other.Concrete location mode can set as required, such as, is distributed in same instance by comparison other identical for key assignments.Comparison other in same grouping can be distributed to an example, also can be distributed to Multi-instance, but the comparison other in an example can be made to belong to same grouping.Certainly, also can have multiple grouping in an example, such as, the comparison other number in certain grouping is less, and after this grouping is distributed to certain example, the part comparison other in can other being divided into groups again is distributed to this example, and the application is not construed as limiting this.
The application is further illustrated below by an application example.
In this example, when comparison data collection, the framework of data set mapreduce (distributed computing framework) is distributed in different INSTANCE (example) and carries out distributed validation.
Determine the schema (schema represents a data set) that data set is corresponding, namely regard data as bivariate table, determine the type often arranged, in this example, the type of row is divided into character string, integer and floating number several types.
Character string type and integer are that the first kind compares item (or claim " exact matching " type), if that is, two row data are equal, then character string wherein, the row of integer (comprising integer) must exact matching.And floating number is Equations of The Second Kind compares item, error also can be thought equal within the specific limits.
As two row data in table 1 below, the row of character string and integer mate all completely, and the difference of floating number meets pre-conditioned, can think that these two row data are equal.Can specify a threshold value in practice, if (f1-f2)/f1 < 0.000001, f1 is the floating number in data 1, f2 is the floating number in data set 2.
Table 1
? Character string Integer Floating number
Data set 1 Hangzhou 100 99.99999999
Data set 2 Hangzhou 100 100.0000000
The character string of two data centralizations and integer row are found out, as the key (key assignments) during distributing data, using floating number as the value that will compare.If the group line number of identical key is equal, and floating number difference is also in acceptance threshold, then can thinking that data set is equal, otherwise report discrepancy.If identical one group of KEY has multiple floating point values, then compare after sequence.For the comparison of following two data sets.Table 2 is data set 1, and table 3 is data set 2, needs to compare.
Table 2
Table 3
City now in two tables, economizes, and month, sales volume was as fiducial value as the key of grouping.
The data of identical key can be distributed in the INSTANCE of same machine and contrast.
Above-mentioned table 2, table 3 can be divided into three groups (group) in logic according to key, and group1, group2, group3 compare respectively in different INSTANCE, on the different machines that these INSTANCE distribute in the cluster.Key1 and key2 in following table 4 be corresponding table 2 respectively, as the row of key in table 3.
Table 4
As can be seen from there being two row data to need contrast in upper figure, group1, wherein the second row is completely equal, and the floating number difference of the first row, in tolerance interval, is also thought equal.The data of group2 are mated completely.Group3 has two row from the data of table 2, and the data from table 3 only have a line, therefore can lack data line with table 2 in this group by account 3.
The result of each group contrast comprehensive, conclusion is that two tables are variant, and concrete difference is in table 3, <NANJING, lack a line in this group of JIANGSU, 1>, this concrete information is also convenient to the root of business being traced difference generation.
Due to employing is distributed comparative approach, can contrast large data volume easily, contrast two tables of over ten billion row in test.
In above-mentioned application example, if there is multiple row floating number, then can comparison by column; But the advantage of Distributed Calculation is also multiple row floating number field to be split out to be combined into one group respectively with key key field, and parallel comparison simultaneously, then combines comparison result again.For example ABCD tetra-fields, A is key key field, and BCD is floating number field respectively, then can compare respectively by AB+AC+AD simultaneously, then carries out result merging.
In addition, when there is multiple row floating number, comparison process first sorts, rear comparison, and sequencer procedure is: be sort one by one, strengthens reordering depth by field.For example ABCD, A are major keys, first complete the sequence of B, complete the sequence of C under the sequence of B, then complete the sequence of D successively.Compare again after having sorted.
Embodiment two
The present embodiment provides a kind of comparing device, as shown in Figure 2, comprising:
Data set configuration module 201, for determining the first data set to be compared and the second data set, and each comparison other of described first data set and the second data centralization comprises and one or morely compares item; And determine the described type comparing item, described type at least comprises: the first kind compares item and the non-first kind compares item;
Comparing module 202, for comparing to the comparison other of described first data centralization and the comparison other of the second data centralization, wherein, if the first kind of the first comparison other of described first data centralization compares item and the corresponding first kind in the second comparison other of described second data centralization, to compare item identical, and the non-first kind of described first comparison other compares item meets pre-conditioned with the corresponding non-first kind difference compared between item of described second comparison other, then judge that described first comparison other is consistent with described second comparison other.
In a kind of alternatives of the present embodiment, described comparing module 202 comprises:
Key assignments extraction unit 2021, compares item for the first kind extracting each comparison other, as the key assignments of this comparison other;
Grouped element 2022, for being divided into groups by described key assignments by the comparison other of described first data centralization, divides into groups the comparison other of described second data centralization by described key assignments; Wherein, its key assignments of comparison other in same group is identical;
Comparing unit 2023, for being compared by the comparison other of described first data set and the second data centralization according to grouping, and the key assignments of comparison other in mutually compare two groups is identical.
In a kind of alternatives of the present embodiment, the comparison other of described first data set and the second data centralization to compare according to grouping and comprises by described comparing unit 2023: obtain first group according to described grouping from described first data centralization, obtain second group from described second data centralization, and the key assignments of each comparison other in described first group is identical with the key assignments of each comparison other in described second group; The non-first kind of each comparison other in described first group is compared the corresponding non-first kind in item and described second group to compare item and compare; If all non-first kind in described first group compares item and compares item with the corresponding non-first kind in described second group and meet described pre-conditioned, and described first group identical with the number of the comparison other in described second group, then described first group with described second group consistent, otherwise, report two groups of differences.
In a kind of alternatives of the present embodiment, described comparing unit 2023 also for, the non-first kind of each comparison other in described first group being compared the corresponding non-first kind in item and described second group compares before item compares, the comparison item of each comparison other in described first group except described key assignments is sorted, the comparison item of each comparison other in described second group except described key assignments is sorted.
In a kind of alternatives of the present embodiment, described comparing module 202 also comprises Dispatching Unit 2024, for being distributed in each example based on distributed computing framework according to described key assignments by the comparison other of the comparison other of described first data centralization and described second data centralization.
In a kind of alternatives of the present embodiment, described comparing module 202 according to judge as under type the non-first kind of the first comparison other compare item with as described in the difference that compares between item of the corresponding non-first kind of the second comparison other whether meet pre-conditioned:
Item is compared to the arbitrary non-first kind of described first comparison other, obtains the difference that it compares item with the corresponding non-first kind of described second comparison other, or, obtain the ratio that described difference and the current non-first kind compared compare item; Judge whether described difference or described ratio are less than predetermined threshold value, if be less than, then the non-first kind of described first comparison other and the current comparison of the second comparison other compares item and meets described pre-conditioned.
In a kind of alternatives of the present embodiment, the described first kind compares item and comprises the comparison item that data type is character string, and/or data type is the comparison item of integer; The described non-first kind compares item and comprises the comparison item that data type is floating number.
The application comprises following advantage:
1, based on cloud computing distributed platform, can be implemented in efficient, the rapid data comparison under large data background.
2, can adjust flexibly (accurate or fuzzy) data precision.
Certainly, the arbitrary product implementing the application might not need to reach above-described all advantages simultaneously.
The all or part of step that one of ordinary skill in the art will appreciate that in said method is carried out instruction related hardware by program and is completed, and described program can be stored in computer-readable recording medium, as ROM (read-only memory), disk or CD etc.Alternatively, all or part of step of above-described embodiment also can use one or more integrated circuit to realize.Correspondingly, each module/unit in above-described embodiment can adopt the form of hardware to realize, and the form of software function module also can be adopted to realize.The application is not restricted to the combination of the hardware and software of any particular form.

Claims (14)

1. a data comparison method, is characterized in that, comprising:
Determine the first data set to be compared and the second data set, and each comparison other of described first data set and the second data centralization comprises and one or morely compares item;
Determine the described type comparing item, described type at least comprises: the first kind compares item and the non-first kind compares item;
The comparison other of described first data centralization and the comparison other of the second data centralization are compared, wherein, if the first kind of the first comparison other of described first data centralization compares item and the corresponding first kind in the second comparison other of described second data centralization, to compare item identical, and the non-first kind of described first comparison other compares item meets pre-conditioned with the corresponding non-first kind difference compared between item of described second comparison other, then judge that described first comparison other is consistent with described second comparison other.
2. the method for claim 1, is characterized in that, described comparing to the comparison other of described first data centralization and the comparison other of the second data centralization comprises:
The first kind extracting each comparison other compares item, as the key assignments of this comparison other;
The comparison other of described first data centralization is divided into groups by described key assignments, the comparison other of described second data centralization is divided into groups by described key assignments; Wherein, its key assignments of comparison other in same group is identical;
According to grouping, the comparison other of described first data set and the second data centralization is compared, and the key assignments of comparison other in mutually compare two groups is identical.
3. method as claimed in claim 2, is characterized in that, described being compared by the comparison other of described first data set and the second data centralization according to grouping comprises:
Obtain first group according to described grouping from described first data centralization, obtain second group from described second data centralization, and the key assignments of each comparison other in described first group is identical with the key assignments of each comparison other in described second group;
The non-first kind of each comparison other in described first group is compared the corresponding non-first kind in item and described second group to compare item and compare;
If all non-first kind in described first group compares item and compares item with the corresponding non-first kind in described second group and meet described pre-conditioned, and described first group identical with the number of the comparison other in described second group, then described first group with described second group consistent, otherwise, report two groups of differences.
4. method as claimed in claim 3, is characterized in that,
The described non-first kind by each comparison other in described first group compares the corresponding non-first kind in item and described second group and compares before item compares, and also comprises:
The comparison item of each comparison other in described first group except described key assignments is sorted, the comparison item of each comparison other in described second group except described key assignments is sorted.
5. the method as described in as arbitrary in claim 2 to 4, it is characterized in that, described method also comprises: also comprise before comparing to the comparison other of described first data set and the second data centralization: be distributed in each example based on distributed computing framework according to described key assignments by the comparison other of the comparison other of described first data centralization and described second data centralization.
6. the method as described in as arbitrary in Claims 1-4, it is characterized in that, the non-first kind of described first comparison other compares item and meets pre-conditioned comprising with the corresponding non-first kind difference compared between item of described second comparison other:
Item is compared to the arbitrary non-first kind of described first comparison other, obtains the difference that it compares item with the corresponding non-first kind of described second comparison other, or, obtain the ratio that described difference and the current non-first kind compared compare item;
Judge whether described difference or described ratio are less than predetermined threshold value, if be less than, then the non-first kind of described first comparison other and the current comparison of the second comparison other compares item and meets described pre-conditioned.
7. the method as described in as arbitrary in Claims 1-4, it is characterized in that, the described first kind compares item and comprises the comparison item that data type is character string, and/or data type is the comparison item of integer; The described non-first kind compares item and comprises the comparison item that data type is floating number.
8. a comparing device, is characterized in that, comprising:
Data set configuration module, for determining the first data set to be compared and the second data set, and each comparison other of described first data set and the second data centralization comprises and one or morely compares item; And determine the described type comparing item, described type at least comprises: the first kind compares item and the non-first kind compares item;
Comparing module, for comparing to the comparison other of described first data centralization and the comparison other of the second data centralization, wherein, if the first kind of the first comparison other of described first data centralization compares item and the corresponding first kind in the second comparison other of described second data centralization, to compare item identical, and the non-first kind of described first comparison other compares item meets pre-conditioned with the corresponding non-first kind difference compared between item of described second comparison other, then judge that described first comparison other is consistent with described second comparison other.
9. device as claimed in claim 8, it is characterized in that, described comparing module comprises:
Key assignments extraction unit, compares item for the first kind extracting each comparison other, as the key assignments of this comparison other;
Grouped element, for being divided into groups by described key assignments by the comparison other of described first data centralization, divides into groups the comparison other of described second data centralization by described key assignments; Wherein, its key assignments of comparison other in same group is identical;
Comparing unit, for being compared by the comparison other of described first data set and the second data centralization according to grouping, and the key assignments of comparison other in mutually compare two groups is identical.
10. device as claimed in claim 9, it is characterized in that, the comparison other of described first data set and the second data centralization to compare according to grouping and comprises by described comparing unit: obtain first group according to described grouping from described first data centralization, obtain second group from described second data centralization, and the key assignments of each comparison other in described first group is identical with the key assignments of each comparison other in described second group; The non-first kind of each comparison other in described first group is compared the corresponding non-first kind in item and described second group to compare item and compare; If all non-first kind in described first group compares item and compares item with the corresponding non-first kind in described second group and meet described pre-conditioned, and described first group identical with the number of the comparison other in described second group, then described first group with described second group consistent, otherwise, report two groups of differences.
11. devices as claimed in claim 10, it is characterized in that, described comparing unit also for, the non-first kind of each comparison other in described first group being compared the corresponding non-first kind in item and described second group compares before item compares, the comparison item of each comparison other in described first group except described key assignments is sorted, the comparison item of each comparison other in described second group except described key assignments is sorted.
12. as arbitrary in claim 9 to 11 as described in device, it is characterized in that, described comparing module also comprises distribution module, for being distributed in each example based on distributed computing framework according to described key assignments by the comparison other of the comparison other of described first data centralization and described second data centralization.
13. as arbitrary in claim 8 to 11 as described in device, it is characterized in that, described comparing module according to judge as under type the non-first kind of the first comparison other compare item with as described in the difference that compares between item of the corresponding non-first kind of the second comparison other whether meet pre-conditioned:
Item is compared to the arbitrary non-first kind of described first comparison other, obtains the difference that it compares item with the corresponding non-first kind of described second comparison other, or, obtain the ratio that described difference and the current non-first kind compared compare item; Judge whether described difference or described ratio are less than predetermined threshold value, if be less than, then the non-first kind of described first comparison other and the current comparison of the second comparison other compares item and meets described pre-conditioned.
14. as arbitrary in claim 8 to 11 as described in device, it is characterized in that, the described first kind compares item and comprises the comparison item that data type is character string, and/or data type is the comparison item of integer; The described non-first kind compares item and comprises the comparison item that data type is floating number.
CN201310224623.2A 2013-06-06 2013-06-06 A kind of data comparison method and device Active CN104239301B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310224623.2A CN104239301B (en) 2013-06-06 2013-06-06 A kind of data comparison method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310224623.2A CN104239301B (en) 2013-06-06 2013-06-06 A kind of data comparison method and device

Publications (2)

Publication Number Publication Date
CN104239301A true CN104239301A (en) 2014-12-24
CN104239301B CN104239301B (en) 2018-02-13

Family

ID=52227395

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310224623.2A Active CN104239301B (en) 2013-06-06 2013-06-06 A kind of data comparison method and device

Country Status (1)

Country Link
CN (1) CN104239301B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572921A (en) * 2014-12-27 2015-04-29 北京奇虎科技有限公司 Cross-datacenter data synchronization method and device
CN104778179A (en) * 2014-01-14 2015-07-15 阿里巴巴集团控股有限公司 Data migration test method and system
CN105677645A (en) * 2014-11-17 2016-06-15 阿里巴巴集团控股有限公司 Data sheet comparison method and device
CN105988889A (en) * 2015-02-11 2016-10-05 阿里巴巴集团控股有限公司 Data check method and apparatus
CN105989089A (en) * 2015-02-12 2016-10-05 阿里巴巴集团控股有限公司 Data comparison method and device
CN106202134A (en) * 2015-05-30 2016-12-07 中国石油化工股份有限公司 Data redundancy inspection method
CN106372668A (en) * 2016-08-31 2017-02-01 新浪网技术(中国)有限公司 Data matching method and device
CN107291672A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 The treating method and apparatus of tables of data
CN108228560A (en) * 2016-12-22 2018-06-29 北京国双科技有限公司 A kind of determining method and device of data type
CN108399151A (en) * 2017-02-06 2018-08-14 百度在线网络技术(北京)有限公司 Comparing system and method
CN108681559A (en) * 2018-04-11 2018-10-19 广东电网有限责任公司 A kind of comparison method and system based on multisystem data application
CN109783697A (en) * 2018-12-14 2019-05-21 北京海数宝科技有限公司 Data processing method, device, computer equipment and storage medium
CN111563073A (en) * 2020-04-20 2020-08-21 杭州市质量技术监督检测院 NQI information sharing method, platform, server and readable storage medium
CN112711683A (en) * 2021-02-25 2021-04-27 浙江口碑网络技术有限公司 Data comparison method and device and computer equipment
CN112905602A (en) * 2021-03-26 2021-06-04 掌阅科技股份有限公司 Data comparison method, computing device and computer storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1472917A (en) * 2002-07-30 2004-02-04 上海阿尔卡特网络支援系统有限公司 Program controlled switcher database corresponding system
CN101887436A (en) * 2009-05-12 2010-11-17 阿里巴巴集团控股有限公司 Retrieval method, device and system
US20110238621A1 (en) * 2010-03-29 2011-09-29 Commvault Systems, Inc. Systems and methods for selective data replication
CN102411588A (en) * 2010-09-26 2012-04-11 金蝶软件(中国)有限公司 Comparison checking method and system of data table
CN102831127A (en) * 2011-06-17 2012-12-19 阿里巴巴集团控股有限公司 Method, device and system for processing repeating data
CN102880650A (en) * 2012-08-27 2013-01-16 中国工商银行股份有限公司 Data matching method and device
US20130311498A1 (en) * 2012-05-05 2013-11-21 Blackbaud, Inc. Systems, methods, and computer program products for data integration and data mapping

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1472917A (en) * 2002-07-30 2004-02-04 上海阿尔卡特网络支援系统有限公司 Program controlled switcher database corresponding system
CN101887436A (en) * 2009-05-12 2010-11-17 阿里巴巴集团控股有限公司 Retrieval method, device and system
US20110238621A1 (en) * 2010-03-29 2011-09-29 Commvault Systems, Inc. Systems and methods for selective data replication
CN102411588A (en) * 2010-09-26 2012-04-11 金蝶软件(中国)有限公司 Comparison checking method and system of data table
CN102831127A (en) * 2011-06-17 2012-12-19 阿里巴巴集团控股有限公司 Method, device and system for processing repeating data
US20130311498A1 (en) * 2012-05-05 2013-11-21 Blackbaud, Inc. Systems, methods, and computer program products for data integration and data mapping
CN102880650A (en) * 2012-08-27 2013-01-16 中国工商银行股份有限公司 Data matching method and device

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778179A (en) * 2014-01-14 2015-07-15 阿里巴巴集团控股有限公司 Data migration test method and system
CN104778179B (en) * 2014-01-14 2019-05-28 阿里巴巴集团控股有限公司 A kind of Data Migration test method and system
CN105677645B (en) * 2014-11-17 2018-12-21 阿里巴巴集团控股有限公司 A kind of tables of data comparison method and device
CN105677645A (en) * 2014-11-17 2016-06-15 阿里巴巴集团控股有限公司 Data sheet comparison method and device
CN104572921A (en) * 2014-12-27 2015-04-29 北京奇虎科技有限公司 Cross-datacenter data synchronization method and device
CN104572921B (en) * 2014-12-27 2017-12-19 北京奇虎科技有限公司 A kind of method of data synchronization and device across data center
CN105988889A (en) * 2015-02-11 2016-10-05 阿里巴巴集团控股有限公司 Data check method and apparatus
CN105988889B (en) * 2015-02-11 2019-06-14 阿里巴巴集团控股有限公司 A kind of data verification method and device
CN105989089A (en) * 2015-02-12 2016-10-05 阿里巴巴集团控股有限公司 Data comparison method and device
CN106202134A (en) * 2015-05-30 2016-12-07 中国石油化工股份有限公司 Data redundancy inspection method
CN107291672A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 The treating method and apparatus of tables of data
CN106372668A (en) * 2016-08-31 2017-02-01 新浪网技术(中国)有限公司 Data matching method and device
CN108228560A (en) * 2016-12-22 2018-06-29 北京国双科技有限公司 A kind of determining method and device of data type
CN108399151A (en) * 2017-02-06 2018-08-14 百度在线网络技术(北京)有限公司 Comparing system and method
CN108399151B (en) * 2017-02-06 2022-02-15 百度在线网络技术(北京)有限公司 Data comparison system and method
CN108681559A (en) * 2018-04-11 2018-10-19 广东电网有限责任公司 A kind of comparison method and system based on multisystem data application
CN108681559B (en) * 2018-04-11 2020-09-25 广东电网有限责任公司 Comparison method and system based on multi-system data application
CN109783697A (en) * 2018-12-14 2019-05-21 北京海数宝科技有限公司 Data processing method, device, computer equipment and storage medium
CN109783697B (en) * 2018-12-14 2021-04-27 北京海数宝科技有限公司 Data processing method, data processing device, computer equipment and storage medium
CN111563073A (en) * 2020-04-20 2020-08-21 杭州市质量技术监督检测院 NQI information sharing method, platform, server and readable storage medium
CN112711683A (en) * 2021-02-25 2021-04-27 浙江口碑网络技术有限公司 Data comparison method and device and computer equipment
CN112905602A (en) * 2021-03-26 2021-06-04 掌阅科技股份有限公司 Data comparison method, computing device and computer storage medium

Also Published As

Publication number Publication date
CN104239301B (en) 2018-02-13

Similar Documents

Publication Publication Date Title
CN104239301A (en) Data comparing method and device
US9442979B2 (en) Data analysis using multiple systems
US10402427B2 (en) System and method for analyzing result of clustering massive data
CN103077183B (en) A kind of data lead-in method and its system of distributed sequence list
EP2738665A1 (en) Similarity analysis method, apparatus, and system
KR100996443B1 (en) System and method of parallel distributed processing of gpu by dividing dense indexed data-files into parts of search and computation in query and database system thereof
CN103793424A (en) Database data migration method and database data migration system
CN102708183B (en) Method and device for data compression
CN102665231B (en) Method of automatically generating parameter configuration file for LTE (Long Term Evolution) system
CN108446315B (en) Big data migration method, device, equipment and storage medium
EP3435256A3 (en) Optimal sort key compression and index rebuilding
CN114281793A (en) Data verification method, device and system
CN104572785A (en) Method and device for establishing index in distributed form
CN102521713B (en) Data processing equipment and data processing method
CN104462462A (en) Service change frequency based data warehouse modeling method and device
CN110020333A (en) Data analysing method and device, electronic equipment, storage medium
CN111274275B (en) Data processing method, apparatus and computer readable storage medium
CN107657050A (en) One kind is based on &#34; with the one-to-one join of conflation algorithm calculating, one-to-many join &#34; contraposition segmentation parallel method
CN104700435A (en) Method for compressing layout data by using OASIS (organization for the advancement of structured information standards) graphic arrays
CN111221690A (en) Model determination method and device for integrated circuit design and terminal
CN115421965A (en) Consistency checking method and device, electronic equipment and storage medium
CN105808577A (en) HBase database-based data batch loading method and device
CN112328641B (en) Multi-dimensional data aggregation method and device and computer equipment
CN103970792A (en) Index-based file comparison method and device
CN103678545A (en) Network resource clustering method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20191210

Address after: P.O. Box 31119, grand exhibition hall, hibiscus street, 802 West Bay Road, Grand Cayman, Cayman Islands

Patentee after: Innovative advanced technology Co., Ltd

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before: Alibaba Group Holding Co., Ltd.

TR01 Transfer of patent right