CN104239301A

CN104239301A - Data comparing method and device

Info

Publication number: CN104239301A
Application number: CN201310224623.2A
Authority: CN
Inventors: 刘祥斌; 夏晨; 杨少华
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd
Priority date: 2013-06-06
Filing date: 2013-06-06
Publication date: 2014-12-24
Anticipated expiration: 2033-06-06
Also published as: CN104239301B

Abstract

The invention provides a data comparing method. The method comprises the steps of determining a first dataset and a second dataset to be compared, wherein each comparative object in the datasets comprises one or a plurality of comparative items; determining the type of the comparative items, wherein each type at least comprise a first class comparative item and a non first class comparative item; comparing a comparative object in the first dataset with a comparative object in a second dataset, if the first class comparative item of the first comparative object in the first data set is identical to a corresponding first class comparative item in a second comparative object in the second dataset, and a difference between the non first class comparative item of the first comparative object and a corresponding non first comparative item of the second comparative object, judging the first comparative object to be consistent with the second comparative object. The invention further provides a data comparing device. The efficient data comparison can be realized.

Description

A kind of data comparison method and device

Technical field

The application relates to data processing field, particularly relates to a kind of data comparison method and device.

Background technology

In common cross-platform data resettlement scene, when the calculating kernel of two platforms and data memory format change, usually need to carry out the Data Comparison before and after moving.If data volume is smaller, can manual verification be carried out, or the DIFF adopting some common (comparison in difference) instrument, as the diff order of LINUX system band.If the data volume needing contrast is mass data, such as, when needing the data acknowledgment number of contrast to have several hundred million, only can not complete by manual verification or conventional tool.

The shortcoming of prior art mainly concentrates on the following aspects:

1) general tool efficiency is low, and time resource cost is uncontrollable;

2) based on common hardware platforms such as units, hardware capabilities is not enough to support.

Summary of the invention

The technical matters that the application will solve is to provide a kind of data comparison method and device, improves mass data comparison efficiency.

In order to solve the problem, this application provides a kind of data comparison method, comprising:

Determine the first data set to be compared and the second data set, and each comparison other of described first data set and the second data centralization comprises and one or morely compares item;

Determine the described type comparing item, described type at least comprises: the first kind compares item and the non-first kind compares item;

The comparison other of described first data centralization and the comparison other of the second data centralization are compared, wherein, if the first kind of the first comparison other of described first data centralization compares item and the corresponding first kind in the second comparison other of described second data centralization, to compare item identical, and the non-first kind of described first comparison other compares item meets pre-conditioned with the corresponding non-first kind difference compared between item of described second comparison other, then judge that described first comparison other is consistent with described second comparison other.

Said method also can have following characteristics, and described comparing to the comparison other of described first data centralization and the comparison other of the second data centralization comprises:

The first kind extracting each comparison other compares item, as the key assignments of this comparison other;

The comparison other of described first data centralization is divided into groups by described key assignments, the comparison other of described second data centralization is divided into groups by described key assignments; Wherein, its key assignments of comparison other in same group is identical;

According to grouping, the comparison other of described first data set and the second data centralization is compared, and the key assignments of comparison other in mutually compare two groups is identical.

Said method also can have following characteristics, and described being compared by the comparison other of described first data set and the second data centralization according to grouping comprises:

Obtain first group according to described grouping from described first data centralization, obtain second group from described second data centralization, and the key assignments of each comparison other in described first group is identical with the key assignments of each comparison other in described second group;

The non-first kind of each comparison other in described first group is compared the corresponding non-first kind in item and described second group to compare item and compare;

If all non-first kind in described first group compares item and compares item with the corresponding non-first kind in described second group and meet described pre-conditioned, and described first group identical with the number of the comparison other in described second group, then described first group with described second group consistent, otherwise, report two groups of differences.

Said method also can have following characteristics, and the described non-first kind by each comparison other in described first group compares the corresponding non-first kind in item and described second group and compares before item compares, and also comprises:

The comparison item of each comparison other in described first group except described key assignments is sorted, the comparison item of each comparison other in described second group except described key assignments is sorted.

Said method also can have following characteristics, and described method also comprises: also comprise before comparing to the comparison other of described first data set and the second data centralization: be distributed in each example based on distributed computing framework according to described key assignments by the comparison other of the comparison other of described first data centralization and described second data centralization.

Said method also can have following characteristics, and the non-first kind of described first comparison other compares item and meets pre-conditioned comprising with the corresponding non-first kind difference compared between item of described second comparison other:

Item is compared to the arbitrary non-first kind of described first comparison other, obtains the difference that it compares item with the corresponding non-first kind of described second comparison other, or, obtain the ratio that described difference and the current non-first kind compared compare item;

Judge whether described difference or described ratio are less than predetermined threshold value, if be less than, then the non-first kind of described first comparison other and the current comparison of the second comparison other compares item and meets described pre-conditioned.

Said method also can have following characteristics, and the described first kind compares item and comprises the comparison item that data type is character string, and/or data type is the comparison item of integer; The described non-first kind compares item and comprises the comparison item that data type is floating number.

The application also provides a kind of comparing device, comprising:

Data set configuration module, for determining the first data set to be compared and the second data set, and each comparison other of described first data set and the second data centralization comprises and one or morely compares item; And determine the described type comparing item, described type at least comprises: the first kind compares item and the non-first kind compares item;

Comparing module, for comparing to the comparison other of described first data centralization and the comparison other of the second data centralization, wherein, if the first kind of the first comparison other of described first data centralization compares item and the corresponding first kind in the second comparison other of described second data centralization, to compare item identical, and the non-first kind of described first comparison other compares item meets pre-conditioned with the corresponding non-first kind difference compared between item of described second comparison other, then judge that described first comparison other is consistent with described second comparison other.

Said apparatus also can have following characteristics, and described comparing module comprises:

Key assignments extraction unit, compares item for the first kind extracting each comparison other, as the key assignments of this comparison other;

Grouped element, for being divided into groups by described key assignments by the comparison other of described first data centralization, divides into groups the comparison other of described second data centralization by described key assignments; Wherein, its key assignments of comparison other in same group is identical;

Comparing unit, for being compared by the comparison other of described first data set and the second data centralization according to grouping, and the key assignments of comparison other in mutually compare two groups is identical.

Said apparatus also can have following characteristics, the comparison other of described first data set and the second data centralization to compare according to grouping and comprises by described comparing unit: obtain first group according to described grouping from described first data centralization, obtain second group from described second data centralization, and the key assignments of each comparison other in described first group is identical with the key assignments of each comparison other in described second group; The non-first kind of each comparison other in described first group is compared the corresponding non-first kind in item and described second group to compare item and compare; If all non-first kind in described first group compares item and compares item with the corresponding non-first kind in described second group and meet described pre-conditioned, and described first group identical with the number of the comparison other in described second group, then described first group with described second group consistent, otherwise, report two groups of differences.

Said apparatus also can have following characteristics, described comparing unit also for, the non-first kind of each comparison other in described first group being compared the corresponding non-first kind in item and described second group compares before item compares, the comparison item of each comparison other in described first group except described key assignments is sorted, the comparison item of each comparison other in described second group except described key assignments is sorted.

Said apparatus also can have following characteristics, and described comparing module also comprises distribution module, for being distributed in each example based on distributed computing framework according to described key assignments by the comparison other of the comparison other of described first data centralization and described second data centralization.

Said apparatus also can have following characteristics, described comparing module according to judge as under type the non-first kind of the first comparison other compare item with as described in the difference that compares between item of the corresponding non-first kind of the second comparison other whether meet pre-conditioned:

Item is compared to the arbitrary non-first kind of described first comparison other, obtains the difference that it compares item with the corresponding non-first kind of described second comparison other, or, obtain the ratio that described difference and the current non-first kind compared compare item; Judge whether described difference or described ratio are less than predetermined threshold value, if be less than, then the non-first kind of described first comparison other and the current comparison of the second comparison other compares item and meets described pre-conditioned.

Said apparatus also can have following characteristics, and the described first kind compares item and comprises the comparison item that data type is character string, and/or data type is the comparison item of integer; The described non-first kind compares item and comprises the comparison item that data type is floating number.

The application comprises following advantage:

1, based on cloud computing distributed platform, can be implemented in efficient, the rapid data comparison under large data background.

2, can adjust flexibly (accurate or fuzzy) data precision.

Certainly, the arbitrary product implementing the application might not need to reach above-described all advantages simultaneously.

Accompanying drawing explanation

Fig. 1 is the embodiment of the present application one data comparison method process flow diagram;

Fig. 2 is the embodiment of the present application two comparing device block diagram.

Embodiment

For making the object of the application, technical scheme and advantage clearly understand, hereinafter will by reference to the accompanying drawings the embodiment of the application be described in detail.It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combination in any mutually.

In addition, although show logical order in flow charts, in some cases, can be different from the step shown or described by order execution herein.

Embodiment one

The present embodiment provides a kind of data comparison method, as shown in Figure 1, comprising:

Step 101, determines the first data set to be compared and the second data set, and each comparison other of described first data set and the second data centralization comprises and one or morely compares item;

Step 102, determine the described type comparing item, described type at least comprises: the first kind compares item and the non-first kind compares item;

Step 103, the comparison other of described first data centralization and the comparison other of the second data centralization are compared, wherein, if the first kind of the first comparison other of described first data centralization compares item and the corresponding first kind in the second comparison other of described second data centralization, to compare item identical, and the non-first kind of described first comparison other compares item meets pre-conditioned with the corresponding non-first kind difference compared between item of described second comparison other, then judge that described first comparison other is consistent with described second comparison other.

In the present embodiment, different comparison items is distinguished, some compares item and needs exact matching, some compares item and only needs fuzzy matching, by the manner of comparison of this difference, when some numerical requirements are not high, relax and compare requirement, difference is also thought identical in pre-conditioned, accelerates comparison speed, improve comparison efficiency.

In the present embodiment, if compared two tables of data, comparison other can be a line in each tables of data, relatively item is each element in this row, table 1 in such as subsequent embodiment, often row is as a comparison other, and the province in this row, city, month, sales volume are that 4 of this comparison other compare item.

The first kind compares item and the non-first kind and compares item and can set as required, such as, can be the comparison item of character string by data type, and/or data type is that the comparison item of integer is set as that the first kind compares item; Be that the comparison item of floating number is set as that the non-first kind compares item by data type, be such as set as that Equations of The Second Kind compares item.Certainly, actual using, can carry out thinner division, such as, is that the comparison item of single precision floating datum is divided into the first kind and compares item by data type, is that the comparison item of double-precision floating points is divided into Equations of The Second Kind and compares item by data type; Or, divide not in accordance with data type, and divide according to the particular content comparing item, such as special identification comparison condition (non-equivalent identification, for example (,) value be " Hangzhou ", the value of another side is " Zhejiang ", then regard as consistent), and when the composition structure of data field can manually judge in advance, can not divide according to data type, it is that the first kind compares item that direct specific data concentrates some to compare item, and some compares item is that Equations of The Second Kind compares item.Or divide more polymorphic type, such as the first kind compares item, and Equations of The Second Kind compares item, the 3rd class compares item etc.

In a kind of alternatives of the present embodiment, in step 103, the non-first kind of described first comparison other compares item and meets pre-conditioned comprising with the corresponding non-first kind difference compared between item of described second comparison other:

Above-mentioned predetermined threshold value is a kind of example, also other judgment modes can be taked, such as, item is compared to the arbitrary non-first kind in described first comparison other, obtain it and compare item with the corresponding non-first kind in described second comparison other, define an active domain respectively, as long as the non-first kind of both sides compares item fall into predefined active domain, then the non-first kind thinking in described first comparison other compares item and compares item with the corresponding non-first kind in described second comparison other and meet pre-conditioned.

In a kind of alternatives of the present embodiment, step 103 specifically comprises:

In a kind of alternatives of the present embodiment, described being compared by the comparison other of described first data set and the second data centralization according to grouping comprises:

In a kind of alternatives of the present embodiment, the described non-first kind by each comparison other in described first group compares the corresponding non-first kind in item and described second group and compares before item compares, and also can comprise:

Such as, first group comprises comparison other Ai, i=1..4, and second group comprises comparison other Bj, j=1...4; Ai comprises and compares ai1, ai2, ai3, an ai4; Bj comprises and compares bj1, bj2, bj3, a bj4; And ai1, ai2, bj1, bj2 are the first kind compares item, ai3, ai4, bj3, bj4 are that Equations of The Second Kind compares item; Ai1, ai2 combination is as key assignments, and with key assignments bj1, bj2 is identical.

First group and second group when comparing, judge first group identical with the data of the comparison other of second group, be 4, then only need compare ai3, ai4 and bj3, bj4, when comparing ai3 and bj3, first sort ai3 by numerical values recited, by bj3 by numerical values recited sequence, compare ai3 again after sequence, bj3.The comparative approach of ai4, bj4 and the comparative approach of ai3, bj3 similar.

In a kind of alternatives of the present embodiment, the comparison of described first data set and the second data set is realized based on cloud computing mode, concrete, before the comparison other of described first data set and the second data centralization is compared, the comparison other of the comparison other of described first data centralization and described second data centralization is distributed in each example according to described key assignments based on distributed computing framework, in each example, realizes the concrete comparison of comparison other.Concrete location mode can set as required, such as, is distributed in same instance by comparison other identical for key assignments.Comparison other in same grouping can be distributed to an example, also can be distributed to Multi-instance, but the comparison other in an example can be made to belong to same grouping.Certainly, also can have multiple grouping in an example, such as, the comparison other number in certain grouping is less, and after this grouping is distributed to certain example, the part comparison other in can other being divided into groups again is distributed to this example, and the application is not construed as limiting this.

The application is further illustrated below by an application example.

In this example, when comparison data collection, the framework of data set mapreduce (distributed computing framework) is distributed in different INSTANCE (example) and carries out distributed validation.

Determine the schema (schema represents a data set) that data set is corresponding, namely regard data as bivariate table, determine the type often arranged, in this example, the type of row is divided into character string, integer and floating number several types.

Character string type and integer are that the first kind compares item (or claim " exact matching " type), if that is, two row data are equal, then character string wherein, the row of integer (comprising integer) must exact matching.And floating number is Equations of The Second Kind compares item, error also can be thought equal within the specific limits.

As two row data in table 1 below, the row of character string and integer mate all completely, and the difference of floating number meets pre-conditioned, can think that these two row data are equal.Can specify a threshold value in practice, if (f1-f2)/f1 < 0.000001, f1 is the floating number in data 1, f2 is the floating number in data set 2.

Table 1

?	Character string	Integer	Floating number
				Data set 1	Hangzhou	100	99.99999999
Data set 2	Hangzhou	100	100.0000000

The character string of two data centralizations and integer row are found out, as the key (key assignments) during distributing data, using floating number as the value that will compare.If the group line number of identical key is equal, and floating number difference is also in acceptance threshold, then can thinking that data set is equal, otherwise report discrepancy.If identical one group of KEY has multiple floating point values, then compare after sequence.For the comparison of following two data sets.Table 2 is data set 1, and table 3 is data set 2, needs to compare.

Table 2

Table 3

City now in two tables, economizes, and month, sales volume was as fiducial value as the key of grouping.

The data of identical key can be distributed in the INSTANCE of same machine and contrast.

Above-mentioned table 2, table 3 can be divided into three groups (group) in logic according to key, and group1, group2, group3 compare respectively in different INSTANCE, on the different machines that these INSTANCE distribute in the cluster.Key1 and key2 in following table 4 be corresponding table 2 respectively, as the row of key in table 3.

Table 4

As can be seen from there being two row data to need contrast in upper figure, group1, wherein the second row is completely equal, and the floating number difference of the first row, in tolerance interval, is also thought equal.The data of group2 are mated completely.Group3 has two row from the data of table 2, and the data from table 3 only have a line, therefore can lack data line with table 2 in this group by account 3.

The result of each group contrast comprehensive, conclusion is that two tables are variant, and concrete difference is in table 3, <NANJING, lack a line in this group of JIANGSU, 1>, this concrete information is also convenient to the root of business being traced difference generation.

Due to employing is distributed comparative approach, can contrast large data volume easily, contrast two tables of over ten billion row in test.

In above-mentioned application example, if there is multiple row floating number, then can comparison by column; But the advantage of Distributed Calculation is also multiple row floating number field to be split out to be combined into one group respectively with key key field, and parallel comparison simultaneously, then combines comparison result again.For example ABCD tetra-fields, A is key key field, and BCD is floating number field respectively, then can compare respectively by AB+AC+AD simultaneously, then carries out result merging.

In addition, when there is multiple row floating number, comparison process first sorts, rear comparison, and sequencer procedure is: be sort one by one, strengthens reordering depth by field.For example ABCD, A are major keys, first complete the sequence of B, complete the sequence of C under the sequence of B, then complete the sequence of D successively.Compare again after having sorted.

Embodiment two

The present embodiment provides a kind of comparing device, as shown in Figure 2, comprising:

Data set configuration module 201, for determining the first data set to be compared and the second data set, and each comparison other of described first data set and the second data centralization comprises and one or morely compares item; And determine the described type comparing item, described type at least comprises: the first kind compares item and the non-first kind compares item;

Comparing module 202, for comparing to the comparison other of described first data centralization and the comparison other of the second data centralization, wherein, if the first kind of the first comparison other of described first data centralization compares item and the corresponding first kind in the second comparison other of described second data centralization, to compare item identical, and the non-first kind of described first comparison other compares item meets pre-conditioned with the corresponding non-first kind difference compared between item of described second comparison other, then judge that described first comparison other is consistent with described second comparison other.

In a kind of alternatives of the present embodiment, described comparing module 202 comprises:

Key assignments extraction unit 2021, compares item for the first kind extracting each comparison other, as the key assignments of this comparison other;

Grouped element 2022, for being divided into groups by described key assignments by the comparison other of described first data centralization, divides into groups the comparison other of described second data centralization by described key assignments; Wherein, its key assignments of comparison other in same group is identical;

Comparing unit 2023, for being compared by the comparison other of described first data set and the second data centralization according to grouping, and the key assignments of comparison other in mutually compare two groups is identical.

In a kind of alternatives of the present embodiment, the comparison other of described first data set and the second data centralization to compare according to grouping and comprises by described comparing unit 2023: obtain first group according to described grouping from described first data centralization, obtain second group from described second data centralization, and the key assignments of each comparison other in described first group is identical with the key assignments of each comparison other in described second group; The non-first kind of each comparison other in described first group is compared the corresponding non-first kind in item and described second group to compare item and compare; If all non-first kind in described first group compares item and compares item with the corresponding non-first kind in described second group and meet described pre-conditioned, and described first group identical with the number of the comparison other in described second group, then described first group with described second group consistent, otherwise, report two groups of differences.

In a kind of alternatives of the present embodiment, described comparing unit 2023 also for, the non-first kind of each comparison other in described first group being compared the corresponding non-first kind in item and described second group compares before item compares, the comparison item of each comparison other in described first group except described key assignments is sorted, the comparison item of each comparison other in described second group except described key assignments is sorted.

In a kind of alternatives of the present embodiment, described comparing module 202 also comprises Dispatching Unit 2024, for being distributed in each example based on distributed computing framework according to described key assignments by the comparison other of the comparison other of described first data centralization and described second data centralization.

In a kind of alternatives of the present embodiment, described comparing module 202 according to judge as under type the non-first kind of the first comparison other compare item with as described in the difference that compares between item of the corresponding non-first kind of the second comparison other whether meet pre-conditioned:

In a kind of alternatives of the present embodiment, the described first kind compares item and comprises the comparison item that data type is character string, and/or data type is the comparison item of integer; The described non-first kind compares item and comprises the comparison item that data type is floating number.

The application comprises following advantage:

2, can adjust flexibly (accurate or fuzzy) data precision.

The all or part of step that one of ordinary skill in the art will appreciate that in said method is carried out instruction related hardware by program and is completed, and described program can be stored in computer-readable recording medium, as ROM (read-only memory), disk or CD etc.Alternatively, all or part of step of above-described embodiment also can use one or more integrated circuit to realize.Correspondingly, each module/unit in above-described embodiment can adopt the form of hardware to realize, and the form of software function module also can be adopted to realize.The application is not restricted to the combination of the hardware and software of any particular form.

Claims

1. a data comparison method, is characterized in that, comprising:

2. the method for claim 1, is characterized in that, described comparing to the comparison other of described first data centralization and the comparison other of the second data centralization comprises:

3. method as claimed in claim 2, is characterized in that, described being compared by the comparison other of described first data set and the second data centralization according to grouping comprises:

4. method as claimed in claim 3, is characterized in that,

The described non-first kind by each comparison other in described first group compares the corresponding non-first kind in item and described second group and compares before item compares, and also comprises:

5. the method as described in as arbitrary in claim 2 to 4, it is characterized in that, described method also comprises: also comprise before comparing to the comparison other of described first data set and the second data centralization: be distributed in each example based on distributed computing framework according to described key assignments by the comparison other of the comparison other of described first data centralization and described second data centralization.

6. the method as described in as arbitrary in Claims 1-4, it is characterized in that, the non-first kind of described first comparison other compares item and meets pre-conditioned comprising with the corresponding non-first kind difference compared between item of described second comparison other:

7. the method as described in as arbitrary in Claims 1-4, it is characterized in that, the described first kind compares item and comprises the comparison item that data type is character string, and/or data type is the comparison item of integer; The described non-first kind compares item and comprises the comparison item that data type is floating number.

8. a comparing device, is characterized in that, comprising:

9. device as claimed in claim 8, it is characterized in that, described comparing module comprises:

10. device as claimed in claim 9, it is characterized in that, the comparison other of described first data set and the second data centralization to compare according to grouping and comprises by described comparing unit: obtain first group according to described grouping from described first data centralization, obtain second group from described second data centralization, and the key assignments of each comparison other in described first group is identical with the key assignments of each comparison other in described second group; The non-first kind of each comparison other in described first group is compared the corresponding non-first kind in item and described second group to compare item and compare; If all non-first kind in described first group compares item and compares item with the corresponding non-first kind in described second group and meet described pre-conditioned, and described first group identical with the number of the comparison other in described second group, then described first group with described second group consistent, otherwise, report two groups of differences.

11. devices as claimed in claim 10, it is characterized in that, described comparing unit also for, the non-first kind of each comparison other in described first group being compared the corresponding non-first kind in item and described second group compares before item compares, the comparison item of each comparison other in described first group except described key assignments is sorted, the comparison item of each comparison other in described second group except described key assignments is sorted.

12. as arbitrary in claim 9 to 11 as described in device, it is characterized in that, described comparing module also comprises distribution module, for being distributed in each example based on distributed computing framework according to described key assignments by the comparison other of the comparison other of described first data centralization and described second data centralization.

13. as arbitrary in claim 8 to 11 as described in device, it is characterized in that, described comparing module according to judge as under type the non-first kind of the first comparison other compare item with as described in the difference that compares between item of the corresponding non-first kind of the second comparison other whether meet pre-conditioned:

14. as arbitrary in claim 8 to 11 as described in device, it is characterized in that, the described first kind compares item and comprises the comparison item that data type is character string, and/or data type is the comparison item of integer; The described non-first kind compares item and comprises the comparison item that data type is floating number.