CN104239301B - A kind of data comparison method and device - Google Patents

A kind of data comparison method and device Download PDF

Info

Publication number
CN104239301B
CN104239301B CN201310224623.2A CN201310224623A CN104239301B CN 104239301 B CN104239301 B CN 104239301B CN 201310224623 A CN201310224623 A CN 201310224623A CN 104239301 B CN104239301 B CN 104239301B
Authority
CN
China
Prior art keywords
comparison
item
data set
group
compares
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310224623.2A
Other languages
Chinese (zh)
Other versions
CN104239301A (en
Inventor
刘祥斌
夏晨
杨少华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201310224623.2A priority Critical patent/CN104239301B/en
Publication of CN104239301A publication Critical patent/CN104239301A/en
Application granted granted Critical
Publication of CN104239301B publication Critical patent/CN104239301B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The application provides a kind of data comparison method, including:Determine the first data set and the second data set to be compared, each comparison other in data set includes one or more relatively items;It is determined that comparing the type of item, the type comprises at least:The first kind compares item and the non-first kind compares item;Comparison other in comparison other in first data set and the second data set is compared, wherein, if the first kind of the first comparison other in the first data set compares item and the corresponding first kind in the second comparison other in second data set, to compare item identical, and first the non-first kind of comparison other compare difference of the item compared with the non-first kind of the correspondence of the second comparison other between item and meet preparatory condition, then judge that the first comparison other and the second comparison other are consistent.The application also provides a kind of comparing device.The application can realize that data efficient compares.

Description

A kind of data comparison method and device
Technical field
The application is related to data processing field, more particularly to a kind of data comparison method and device.
Background technology
Scene is moved in common cross-platform data, when the calculating kernel and data memory format of two platforms change Become, it usually needs carry out migrating front and rear data comparison.If data volume is smaller, manual verification can be carried out, or use Some common DIFF (comparison in difference) instruments, such as the diff orders of LINUX system band.Data volume if necessary to contrast is sea Data are measured, for example, it is necessary to when the data acknowledgment number of contrast there are several hundred million, only manually checking or conventional tool are impossible complete Into.
The shortcomings that prior art, is concentrated mainly on the following aspects:
1) general tool efficiency is low, and time resource cost is uncontrollable;
2) it is not enough to support based on common hardware platform, hardware capabilities such as units.
The content of the invention
The application technical problems to be solved are to provide a kind of data comparison method and device, improve mass data and compare effect Rate.
In order to solve the above problems, this application provides a kind of data comparison method, including:
The first data set and the second data set to be compared are determined, and it is every in first data set and the second data set Individual comparison other includes one or more relatively items;
The type of the relatively item is determined, the type comprises at least:The first kind compares item and the non-first kind compares item;
Comparison other in comparison other and the second data set in first data set is compared, wherein, such as What the first kind of the first comparison other in the first data set described in fruit compared in item and second data set second compares pair The corresponding first kind as in compares that item is identical, and the non-first kind of first comparison other compares item compared with described second pair The difference that the non-first kind of correspondence of elephant compares between item meets preparatory condition, then judges first comparison other and described second Comparison other is consistent.
The above method can also have the characteristics that, the comparison other and the second data set in first data set In comparison other be compared including:
The first kind for extracting each comparison other compares item, the key assignments as the comparison other;
Comparison other in first data set is grouped by the key assignments, by the ratio in second data set It is grouped compared with object by the key assignments;Wherein, comparison other its key assignments in same group is identical;
The comparison other in first data set and the second data set is compared according to packet, and be compared to each other The key assignments of comparison other in two groups is identical.
The above method can also have the characteristics that, it is described according to packet by first data set and the second data set Comparison other be compared including:
First group is obtained from first data set according to the packet, second is obtained from second data set Group, and the key assignments of each comparison other in described first group is identical with the key assignments of each comparison other in described second group;
By the non-first kind of each comparison other in described first group compare item and it is described second group in correspond to the non-first kind Compare item to be compared;
If all non-first kind in described first group compare item compared with the non-first kind of correspondence in described second group Item meets the preparatory condition, and described first group with described second group in comparison other number it is identical, then described first Group is consistent with described second group, otherwise, reports two groups of differences.
The above method can also have the characteristics that the non-first kind of each comparison other by described first group compares And described second group in correspond to the non-first kind and compare before item is compared, in addition to:
Comparison item of each comparison other in described first group in addition to the key assignments is ranked up, by described second group Comparison item of each comparison other in addition to the key assignments be ranked up.
The above method can also have the characteristics that methods described also includes:To first data set and the second data set In comparison other be compared before also include:By in the comparison other in first data set and second data set Comparison other is distributed in each example according to the key assignments based on distributed computing framework.
The above method can also have the characteristics that the non-first kind of first comparison other compares item and compared with described second The difference compared compared with the non-first kind of the correspondence of object between item meets that preparatory condition includes:
Item is compared to any non-first kind of first comparison other, it is corresponding with second comparison other to obtain it The non-first kind compares the difference of item, or, obtain the ratio that the non-first kind of the difference compared with currently compares item;
Judge whether the difference or the ratio are less than predetermined threshold value, if it is less, first comparison other and The non-first kind that second comparison other currently compares compares item and meets the preparatory condition.
The above method can also have the characteristics that the first kind, which compares item, includes the comparison that data type is character string , and/or, data type is the comparison item of integer;The non-first kind, which compares item, includes the comparison that data type is floating number .
The application also provides a kind of comparing device, including:
Data set configuration module, for determining the first data set and the second data set to be compared, and first data Each comparison other in collection and the second data set includes one or more relatively items;And the type of the relatively item is determined, The type comprises at least:The first kind compares item and the non-first kind compares item;
Comparing module, for being carried out to the comparison other in the comparison other and the second data set in first data set Compare, wherein, if the first kind of the first comparison other in first data set compares in item and second data set The second comparison other in the corresponding first kind compare that item is identical, and the non-first kind of first comparison other compares item and institute State the difference that the non-first kind of correspondence of the second comparison other compares between item and meet preparatory condition, then judge that described first compares pair As consistent with second comparison other.
Said apparatus can also have the characteristics that the comparing module includes:
Key assignments extraction unit, the first kind for extracting each comparison other compare item, the key assignments as the comparison other;
Grouped element, for the comparison other in first data set to be grouped by the key assignments, by described Comparison other in two data sets is grouped by the key assignments;Wherein, comparison other its key assignments in same group is identical;
Comparing unit, for being compared the comparison other in first data set and the second data set according to packet Compared with, and the key assignments of the comparison other in two groups be compared to each other is identical.
Said apparatus can also have the characteristics that the comparing unit counts first data set and second according to packet According to the comparison other of concentration be compared including:From first data set first group is obtained according to the packet, from described Second group, and the key assignments of each comparison other in described first group and each comparison in described second group are obtained in second data set The key assignments of object is identical;By the non-first kind of each comparison other in described first group compare item and it is described second group in correspond to it is non- The first kind compares item and is compared;If all non-first kind in described first group compare item with it is corresponding in described second group The non-first kind compares item and meets the preparatory condition, and described first group and it is described second group in comparison other number phase Together, then described first group it is consistent with described second group, otherwise, report two groups of differences.
Said apparatus can also have the characteristics that the comparing unit is additionally operable to, by each comparison pair in described first group The non-first kind of elephant compare item and it is described second group in correspond to the non-first kind and compare before item is compared, by described first group Comparison item of each comparison other in addition to the key assignments is ranked up, by each comparison other in described second group in addition to the key assignments Comparison item be ranked up.
Said apparatus can also have the characteristics that the comparing module also includes distribution module, for described first to be counted According to the comparison other in the comparison other and second data set of concentration according to the key assignments based on distributed computing framework point Cloth is into each example.
Said apparatus can also have the characteristics that the comparing module judges the non-of the first comparison other according to following manner The first kind compares whether difference of the item compared with the non-first kind of correspondence of second comparison other between item meets preparatory condition:
Item is compared to any non-first kind of first comparison other, it is corresponding with second comparison other to obtain it The non-first kind compares the difference of item, or, obtain the ratio that the non-first kind of the difference compared with currently compares item;Judge institute State difference or whether the ratio is less than predetermined threshold value, if it is less, first comparison other and the second comparison other are worked as The non-first kind of preceding comparison compares item and meets the preparatory condition.
Said apparatus can also have the characteristics that the first kind, which compares item, includes the comparison that data type is character string , and/or, data type is the comparison item of integer;The non-first kind, which compares item, includes the comparison that data type is floating number .
The application includes following advantage:
1st, based on cloud computing distributed platform, it is possible to achieve efficient, rapid data under big data background compare.
2nd, data precision can be adjusted flexibly (accurate or fuzzy).
Certainly, any product for implementing the application it is not absolutely required to reach all the above advantage simultaneously.
Brief description of the drawings
Fig. 1 is the data comparison method flow chart of the embodiment of the present application one;
Fig. 2 is the comparing device block diagram of the embodiment of the present application two.
Embodiment
For the purpose, technical scheme and advantage of the application are more clearly understood, below in conjunction with accompanying drawing to the application Embodiment be described in detail.It should be noted that in the case where not conflicting, in the embodiment and embodiment in the application Feature can mutually be combined.
, can be with different from herein in addition, though show logical order in flow charts, but in some cases Order performs shown or described step.
Embodiment one
The present embodiment provides a kind of data comparison method, as shown in figure 1, including:
Step 101, the first data set and the second data set to be compared, and first data set and the second data are determined The each comparison other concentrated includes one or more relatively items;
Step 102, the type of the relatively item is determined, the type comprises at least:The first kind compares item and the non-first kind Compare item;
Step 103, the comparison other in the comparison other and the second data set in first data set is compared, Wherein, if the first kind of the first comparison other in first data set compares second in item and second data set The corresponding first kind in comparison other compares that item is identical, and the non-first kind of first comparison other compares item and described second The difference that the non-first kind of correspondence of comparison other compares between item meets preparatory condition, then judges first comparison other and institute It is consistent to state the second comparison other.
In the present embodiment, different comparison items is made a distinction, some compare item needs and accurately matched, and some compare item only Fuzzy matching is needed, by the manner of comparison of this difference, in the case where some numerical requirements are not high, relax to compare and wants Ask, difference being also considered as in preparatory condition is identical, comparison speed is accelerated, improves comparison efficiency.
In the present embodiment, if be compared to two tables of data, comparison other can be a line in each tables of data, It is each element in the row to compare item, such as the table 1 in subsequent embodiment, and often row is used as a comparison other, in the row Province, city, month, 4 comparison items that sales volume is the comparison other.
The first kind compares item and the non-first kind compares item and can set as needed, such as, can be word by data type The comparison item of string is accorded with, and/or, data type is set as that the first kind compares item for the comparison item of integer;It is floating-point by data type Several comparison items is set as that the non-first kind compares item, for example is set as that the second class compares item.Certainly, in actual use, Ke Yijin The thinner division of row, such as, data type is divided into the first kind for the comparison item of single precision floating datum and compares item, by data class Type is divided into the second class for the comparison item of double-precision floating pointses and compares item;Or do not divided according to data type, and according to The particular content for comparing item is divided, such as the special identification comparison condition (value of non-equivalent identification, for example one side For " Hangzhou ", the value of another side is " Zhejiang ", then regards as consistent), and the composition structure of data field manually can judge in advance In the case of, it can not be divided according to data type, some relatively items are that the first kind compares item in directly specified data set, some Compare item and compare item for the second class.Or more polymorphic type is divided, for example the first kind compares item, the second class compares item, the 3rd analogy Compared with item etc..
In a kind of alternative of the present embodiment, in step 103, the non-first kind of first comparison other compares item Difference compared with the non-first kind of correspondence of second comparison other between item meets that preparatory condition includes:
Item is compared to any non-first kind of first comparison other, it is corresponding with second comparison other to obtain it The non-first kind compares the difference of item, or, obtain the ratio that the non-first kind of the difference compared with currently compares item;
Judge whether the difference or the ratio are less than predetermined threshold value, if it is less, first comparison other and The non-first kind that second comparison other currently compares compares item and meets the preparatory condition.
Above-mentioned predetermined threshold value is a kind of example, can also take other judgment modes, such as, compare described first pair Any non-first kind as in compares item, obtains its item compared with the non-first kind of correspondence in second comparison other, respectively An active domain is defined, as long as the non-first kind of both sides, which compares item, falls into pre-defined active domain, then it is assumed that first ratio Compare item item compared with the non-first kind of correspondence in second comparison other compared with the non-first kind in object and meet preparatory condition.
In a kind of alternative of the present embodiment, step 103 specifically includes:
The first kind for extracting each comparison other compares item, the key assignments as the comparison other;
Comparison other in first data set is grouped by the key assignments, by the ratio in second data set It is grouped compared with object by the key assignments;Wherein, comparison other its key assignments in same group is identical;
The comparison other in first data set and the second data set is compared according to packet, and be compared to each other The key assignments of comparison other in two groups is identical.
In a kind of alternative of the present embodiment, it is described according to packet by first data set and the second data set Comparison other be compared including:
First group is obtained from first data set according to the packet, second is obtained from second data set Group, and the key assignments of each comparison other in described first group is identical with the key assignments of each comparison other in described second group;
By the non-first kind of each comparison other in described first group compare item and it is described second group in correspond to the non-first kind Compare item to be compared;
If all non-first kind in described first group compare item compared with the non-first kind of correspondence in described second group Item meets the preparatory condition, and described first group with described second group in comparison other number it is identical, then described first Group is consistent with described second group, otherwise, reports two groups of differences.
In a kind of alternative of the present embodiment, the non-first kind ratio of each comparison other by described first group Compared with item and it is described second group in correspond to the non-first kind and compare before item is compared, may also include:
Comparison item of each comparison other in described first group in addition to the key assignments is ranked up, by described second group Comparison item of each comparison other in addition to the key assignments be ranked up.
For example first group include Ai, i=1..4, the second groups of comparison other and include comparison other Bj, j=1...4; Ai includes comparing an ai1, ai2, ai3, ai4;Bj includes comparing a bj1, bj2, bj3, bj4;And ai1, ai2, bj1, bj2 It is that the first kind compares item, ai3, ai4, bj3, bj4 are that the second class compares item;Ai1, ai2 combination are used as key assignments, and key assignments bj1, Bj2 is identical.
First group and second group when comparing, judge first group it is identical with the data of second group of comparison other, be 4, Then it need to only compare ai3, ai4 and bj3, bj4, when comparing ai3 and bj3, first ai3 sort by numerical values recited, bj3 is pressed into numerical value Size sorts, and compares ai3, bj3 after sequence again.Ai4, bj4 comparative approach and ai3, bj3 comparative approach are similar.
In a kind of alternative of the present embodiment, first data set and the second data are realized based on cloud computing mode The comparison of collection, specifically, before the comparison other in first data set and the second data set is compared, by described first The comparison other in comparison other and second data set in data set is based on distributed computing framework according to the key assignments It is distributed in each example, the specific comparison of comparison other is realized in each example.Specific location mode can be set as needed, than Such as, key assignments identical comparison other is distributed in same instance.Comparison other in same packet can be distributed to a reality Example, can also be distributed to multiple examples, but the comparison other in an example can be caused to belong to same packet.Certainly, one There can also be multiple packets in example, such as, the comparison other number in some packet is less, and the packet is distributed into some reality After example, other part comparison others in being grouped can be distributed to the example again, the application is not construed as limiting to this.
The application is further illustrated below by an application example.
In the example, in comparison data collection, data set is distributed with mapreduce (distributed computing framework) framework Distributed validation is carried out into different INSTANCE (example).
Schema corresponding to data set (schema represents a data set) is determined, that is, regards data as two dimension Table, determines the type of each column, and the type of row is divided into character string, integer and floating number several types in this example.
Character string type and integer are that the first kind compares item (or " accurate matching " type), i.e. if two row data Equal, then character string therein, the row of integer (including integer) must be matched accurately.And floating number compares item for the second class, by mistake Difference is within the specific limits it is also assumed that be equal.
Such as two row data in table 1 below, the row of character string and integer all match completely, and the difference of floating number meets in advance If condition, it is believed that this two rows data is equal.A threshold value can be specified in practice, such as (f1-f2)/f1 < 0.000001, f1 is the floating number in data 1, and f2 is the floating number in data set 2.
Table 1
Character string Integer Floating number
Data set 1 Hangzhou 100 99.99999999
Data set 2 Hangzhou 100 100.0000000
The character string in two datasets and integer row find out, as distribution data when key (key assignments), by floating-point Number is as the value to be compared.If identical key group line number is equal, and floating number difference is also in acceptable thresholds, then it is assumed that Data set is equal, otherwise report discrepancy.If one group of KEY of identical there are multiple floating point values, it is compared after sequence.With as follows Exemplified by the comparison of two datasets.Table 2 is data set 1, and table 3 is data set 2, it is necessary to be compared.
Table 2
Table 3
The now city in two tables, save, in key of the month as packet, sales volume is as fiducial value.
Identical key data can be distributed in the INSTANCE of same machine and be contrasted.
Above-mentioned table 2, table 3 can be divided into three groups (group) in logic according to key, group1, group2, Group3 is compared in different INSTANCE respectively, on the different machines of these INSTANCE distributions in the cluster.Under Key1 in table 4 and key2 difference corresponding tables 2, the row in table 3 as key.
Table 4
There are two row data to need to contrast in group1 it can be seen from upper figure, wherein the second row is essentially equal, the first row Floating number difference is also considered as equal in tolerance interval.Group2 data match completely.Data of the group3 from table 2 There are two rows, the data from table 3 only have a line, therefore can lack data line in this group with account 3 and table 2.
The result of comprehensive each group of contrast, conclusion is that two tables are variant, and specific difference is in table 3,<NANJING, JIANGSU, 1>Lack a line in this group, this specific information also allows for tracing root caused by difference in business.
Due to using distributed comparative approach, in that context it may be convenient to contrast big data volume, contrasted in test Two tables of 10000000000 rows.
In above-mentioned application example, if there is multiple row floating number, then it can compare by column;But the advantage of Distributed Calculation is also It is multiple row floating number field can be splitted out and is respectively combined into one group with key key fields, compare simultaneously parallel, then will Comparison result is combined again.For example tetra- fields of ABCD, A are key key fields, and BCD is floating number field respectively, then can be with AB + AC+AD is compared simultaneously respectively, then carries out result merging.
When additionally, there are multiple row floating number, comparison process is first to sort, rear to compare, and sequencer procedure is:It is to sort one by one, Reordering depth is increased by field.For example ABCD, A are major keys, first complete B sequence, the completion C sequence under B sequence, then according to Secondary completion D sequence.It is compared again after the completion of sequence.
Embodiment two
The present embodiment provides a kind of comparing device, as shown in Fig. 2 including:
Data set configuration module 201, for determining the first data set and the second data set to be compared, and first number Include one or more relatively items according to each comparison other in collection and the second data set;And determine the class of the relatively item Type, the type comprise at least:The first kind compares item and the non-first kind compares item;
Comparing module 202, for the comparison other in the comparison other and the second data set in first data set It is compared, wherein, if the first kind of the first comparison other in first data set compares item and second data The corresponding first kind in the second comparison other concentrated compares that item is identical, and the non-first kind of first comparison other compares item Difference compared with the non-first kind of correspondence of second comparison other between item meets preparatory condition, then judges first ratio It is consistent with second comparison other compared with object.
In a kind of alternative of the present embodiment, the comparing module 202 includes:
Key assignments extraction unit 2021, the first kind for extracting each comparison other compares item, as the comparison other Key assignments;
Grouped element 2022, for the comparison other in first data set to be grouped by the key assignments, by institute The comparison other stated in the second data set is grouped by the key assignments;Wherein, comparison other its key assignments in same group is identical;
Comparing unit 2023, for being carried out the comparison other in first data set and the second data set according to packet Compare, and the key assignments of the comparison other in two be compared to each other group is identical.
In a kind of alternative of the present embodiment, the comparing unit 2023 according to packet will first data set with Comparison other in second data set be compared including:First group is obtained from first data set according to the packet, Second group is obtained from second data set, and in the key assignments of each comparison other in described first group and described second group The key assignments of each comparison other is identical;By the non-first kind of each comparison other in described first group compare item and it is described second group in The corresponding non-first kind compares item and is compared;If all non-first kind in described first group compare in item and described second group The non-first kind of correspondence compare item and meet the preparatory condition, and described first group and it is described second group in comparison other number Mesh is identical, then described first group it is consistent with described second group, otherwise, report two groups of differences.
In a kind of alternative of the present embodiment, the comparing unit 2023 is additionally operable to, will be each in described first group The non-first kind of comparison other compare item and it is described second group in correspond to the non-first kind and compare before item is compared, by described first Comparison item of each comparison other in addition to the key assignments in group is ranked up, by each comparison other in described second group except described Comparison item outside key assignments is ranked up.
In a kind of alternative of the present embodiment, the comparing module 202 also includes Dispatching Unit 2024, for by institute The comparison other stated in the comparison other and second data set in the first data set is based on distributed count according to the key assignments Framework is calculated to be distributed in each example.
In a kind of alternative of the present embodiment, the comparing module 202 judges that first compares pair according to following manner The non-first kind of elephant compare difference of the item compared with the non-first kind of correspondence of second comparison other between item whether meet it is pre- If condition:
Item is compared to any non-first kind of first comparison other, it is corresponding with second comparison other to obtain it The non-first kind compares the difference of item, or, obtain the ratio that the non-first kind of the difference compared with currently compares item;Judge institute State difference or whether the ratio is less than predetermined threshold value, if it is less, first comparison other and the second comparison other are worked as The non-first kind of preceding comparison compares item and meets the preparatory condition.
In a kind of alternative of the present embodiment, the first kind, which compares item, includes the comparison that data type is character string , and/or, data type is the comparison item of integer;The non-first kind, which compares item, includes the comparison that data type is floating number .
The application includes following advantage:
1st, based on cloud computing distributed platform, it is possible to achieve efficient, rapid data under big data background compare.
2nd, data precision can be adjusted flexibly (accurate or fuzzy).
Certainly, any product for implementing the application it is not absolutely required to reach all the above advantage simultaneously.
One of ordinary skill in the art will appreciate that all or part of step in the above method can be instructed by program Related hardware is completed, and described program can be stored in computer-readable recording medium, such as read-only storage, disk or CD Deng.Alternatively, all or part of step of above-described embodiment can also be realized using one or more integrated circuits.Accordingly Ground, each module/unit in above-described embodiment can be realized in the form of hardware, can also use the shape of software function module Formula is realized.The application is not restricted to the combination of the hardware and software of any particular form.

Claims (10)

  1. A kind of 1. data comparison method, it is characterised in that including:
    Determine the first data set and the second data set to be compared, and each ratio in first data set and the second data set Include one or more relatively items compared with object;
    The type of the relatively item is determined, the type comprises at least:The first kind compares item and the non-first kind compares item;
    Comparison other in comparison other and the second data set in first data set is compared, wherein, if institute The first kind for stating the first comparison other in the first data set compares in the second comparison other in item and second data set The corresponding first kind compare that item is identical, and the non-first kind of first comparison other compares item and second comparison other The difference that the corresponding non-first kind compares between item meets preparatory condition, then judges that first comparison other and described second compares Object is consistent;The first kind, which compares item, includes the comparison item that data type is character string, and/or, data type is integer Compare item;The non-first kind, which compares item, includes the comparison item that data type is floating number;
    Wherein, bag is compared in the comparison other in the comparison other and the second data set in first data set Include:
    The first kind for extracting each comparison other compares item, the key assignments as the comparison other;
    Comparison other in first data set is grouped by the key assignments, by the comparison pair in second data set As being grouped by the key assignments;Wherein, comparison other its key assignments in same group is identical;
    The comparison other in first data set and the second data set is compared according to packet, and two be compared to each other group In comparison other key assignments it is identical.
  2. 2. the method as described in claim 1, it is characterised in that it is described according to packet by first data set and the second data The comparison other of concentration be compared including:
    From first data set first group is obtained according to the packet, obtain second group from second data set, and The key assignments of each comparison other in described first group is identical with the key assignments of each comparison other in described second group;
    By the non-first kind of each comparison other in described first group compare item and it is described second group in correspond to the non-first kind and compare Item is compared;
    If all non-first kind in described first group compare item, the item compared with the non-first kind of correspondence in described second group is expired The foot preparatory condition, and described first group with described second group in comparison other number it is identical, then described first group and Described second group consistent, otherwise, reports two groups of differences.
  3. 3. method as claimed in claim 2, it is characterised in that
    The non-first kind of each comparison other by described first group compare item and it is described second group in correspond to the non-first kind Compare before item is compared, in addition to:
    Comparison item of each comparison other in described first group in addition to the key assignments is ranked up, will be each in described second group Comparison item of the comparison other in addition to the key assignments is ranked up.
  4. 4. the method as described in claims 1 to 3 is any, it is characterised in that methods described also includes:To first data set Also include before being compared with the comparison other in the second data set:By the comparison other in first data set and described Comparison other in two data sets is distributed in each example according to the key assignments based on distributed computing framework.
  5. 5. the method as described in claims 1 to 3 is any, it is characterised in that the non-first kind of first comparison other compares Difference of the item compared with the non-first kind of correspondence of second comparison other between item meets that preparatory condition includes:
    Item is compared to any non-first kind of first comparison other, obtain its corresponding with second comparison other non- The difference of one kind relatively item, or, obtain the ratio that the non-first kind of the difference compared with currently compares item;
    Judge whether the difference or the ratio are less than predetermined threshold value, if it is less, first comparison other and second The non-first kind that comparison other currently compares compares item and meets the preparatory condition.
  6. A kind of 6. comparing device, it is characterised in that including:
    Data set configuration module, for determining the first data set and the second data set to be compared, and first data set and Each comparison other in second data set includes one or more relatively items;And the type of the relatively item is determined, it is described Type comprises at least:The first kind compares item and the non-first kind compares item;
    Comparing module, for comparing the comparison other in the comparison other and the second data set in first data set It is right, wherein, if the first kind of the first comparison other in first data set compares in item and second data set The corresponding first kind in second comparison other compares that item is identical, and the non-first kind of first comparison other compare item with it is described The difference that the non-first kind of correspondence of second comparison other compares between item meets preparatory condition, then judges first comparison other It is consistent with second comparison other;The first kind, which compares item, includes the comparison item that data type is character string, and/or, number According to the comparison item that type is integer;The non-first kind, which compares item, includes the comparison item that data type is floating number;
    Wherein, the comparing module includes:
    Key assignments extraction unit, the first kind for extracting each comparison other compare item, the key assignments as the comparison other;
    Grouped element, for the comparison other in first data set to be grouped by the key assignments, described second is counted It is grouped according to the comparison other of concentration by the key assignments;Wherein, comparison other its key assignments in same group is identical;
    Comparing unit, for the comparison other in first data set and the second data set to be compared according to packet, and The key assignments for the comparison other in two groups being compared to each other is identical.
  7. 7. device as claimed in claim 6, it is characterised in that the comparing unit according to packet will first data set with Comparison other in second data set be compared including:First group is obtained from first data set according to the packet, Second group is obtained from second data set, and in the key assignments of each comparison other in described first group and described second group The key assignments of each comparison other is identical;By the non-first kind of each comparison other in described first group compare item and it is described second group in The corresponding non-first kind compares item and is compared;If all non-first kind in described first group compare in item and described second group The non-first kind of correspondence compare item and meet the preparatory condition, and described first group and it is described second group in comparison other number Mesh is identical, then described first group it is consistent with described second group, otherwise, report two groups of differences.
  8. 8. device as claimed in claim 7, it is characterised in that the comparing unit is additionally operable to, will be each in described first group The non-first kind of comparison other compare item and it is described second group in correspond to the non-first kind and compare before item is compared, by described first Comparison item of each comparison other in addition to the key assignments in group is ranked up, by each comparison other in described second group except described Comparison item outside key assignments is ranked up.
  9. 9. the device as described in claim 6 to 8 is any, it is characterised in that the comparing module also includes distribution module, is used for Comparison other in comparison other in first data set and second data set is based on distribution according to the key assignments Formula Computational frame is distributed in each example.
  10. 10. the device as described in claim 6 to 8 is any, it is characterised in that the comparing module judges according to following manner The non-first kind of one comparison other compares difference of the item compared with the non-first kind of correspondence of second comparison other between item It is no to meet preparatory condition:
    Item is compared to any non-first kind of first comparison other, obtain its corresponding with second comparison other non- The difference of one kind relatively item, or, obtain the ratio that the non-first kind of the difference compared with currently compares item;Judge the difference Whether value or the ratio are less than predetermined threshold value, if it is less, first comparison other and the second comparison other currently compare Compared with the non-first kind compare item and meet the preparatory condition.
CN201310224623.2A 2013-06-06 2013-06-06 A kind of data comparison method and device Active CN104239301B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310224623.2A CN104239301B (en) 2013-06-06 2013-06-06 A kind of data comparison method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310224623.2A CN104239301B (en) 2013-06-06 2013-06-06 A kind of data comparison method and device

Publications (2)

Publication Number Publication Date
CN104239301A CN104239301A (en) 2014-12-24
CN104239301B true CN104239301B (en) 2018-02-13

Family

ID=52227395

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310224623.2A Active CN104239301B (en) 2013-06-06 2013-06-06 A kind of data comparison method and device

Country Status (1)

Country Link
CN (1) CN104239301B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778179B (en) * 2014-01-14 2019-05-28 阿里巴巴集团控股有限公司 A kind of Data Migration test method and system
CN105677645B (en) * 2014-11-17 2018-12-21 阿里巴巴集团控股有限公司 A kind of tables of data comparison method and device
CN104572921B (en) * 2014-12-27 2017-12-19 北京奇虎科技有限公司 A kind of method of data synchronization and device across data center
CN105988889B (en) * 2015-02-11 2019-06-14 阿里巴巴集团控股有限公司 A kind of data verification method and device
CN105989089A (en) * 2015-02-12 2016-10-05 阿里巴巴集团控股有限公司 Data comparison method and device
CN106202134A (en) * 2015-05-30 2016-12-07 中国石油化工股份有限公司 Data redundancy inspection method
CN107291672B (en) * 2016-03-31 2020-11-20 阿里巴巴集团控股有限公司 Data table processing method and device
CN106372668A (en) * 2016-08-31 2017-02-01 新浪网技术(中国)有限公司 Data matching method and device
CN108228560A (en) * 2016-12-22 2018-06-29 北京国双科技有限公司 A kind of determining method and device of data type
CN108399151B (en) * 2017-02-06 2022-02-15 百度在线网络技术(北京)有限公司 Data comparison system and method
CN108681559B (en) * 2018-04-11 2020-09-25 广东电网有限责任公司 Comparison method and system based on multi-system data application
CN109783697B (en) * 2018-12-14 2021-04-27 北京海数宝科技有限公司 Data processing method, data processing device, computer equipment and storage medium
CN111563073B (en) * 2020-04-20 2023-07-07 杭州市质量技术监督检测院 NQI information sharing method, platform, server and readable storage medium
CN112711683A (en) * 2021-02-25 2021-04-27 浙江口碑网络技术有限公司 Data comparison method and device and computer equipment
CN112905602B (en) * 2021-03-26 2022-09-30 掌阅科技股份有限公司 Data comparison method, computing device and computer storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1472917A (en) * 2002-07-30 2004-02-04 上海阿尔卡特网络支援系统有限公司 Program controlled switcher database corresponding system
CN101887436A (en) * 2009-05-12 2010-11-17 阿里巴巴集团控股有限公司 Retrieval method, device and system
CN102411588A (en) * 2010-09-26 2012-04-11 金蝶软件(中国)有限公司 Comparison checking method and system of data table
CN102831127A (en) * 2011-06-17 2012-12-19 阿里巴巴集团控股有限公司 Method, device and system for processing repeating data
CN102880650A (en) * 2012-08-27 2013-01-16 中国工商银行股份有限公司 Data matching method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8504517B2 (en) * 2010-03-29 2013-08-06 Commvault Systems, Inc. Systems and methods for selective data replication
US9443033B2 (en) * 2012-05-05 2016-09-13 Blackbaud, Inc. Systems, methods, and computer program products for data integration and data mapping

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1472917A (en) * 2002-07-30 2004-02-04 上海阿尔卡特网络支援系统有限公司 Program controlled switcher database corresponding system
CN101887436A (en) * 2009-05-12 2010-11-17 阿里巴巴集团控股有限公司 Retrieval method, device and system
CN102411588A (en) * 2010-09-26 2012-04-11 金蝶软件(中国)有限公司 Comparison checking method and system of data table
CN102831127A (en) * 2011-06-17 2012-12-19 阿里巴巴集团控股有限公司 Method, device and system for processing repeating data
CN102880650A (en) * 2012-08-27 2013-01-16 中国工商银行股份有限公司 Data matching method and device

Also Published As

Publication number Publication date
CN104239301A (en) 2014-12-24

Similar Documents

Publication Publication Date Title
CN104239301B (en) A kind of data comparison method and device
US9442979B2 (en) Data analysis using multiple systems
CN107807982B (en) Consistency checking method and device for heterogeneous database
US9223829B2 (en) Interdistinct operator
CN107733869B (en) Equipment identification method and device
CN103530334B (en) Based on the data matching system and method for comparing template
US20150032759A1 (en) System and method for analyzing result of clustering massive data
EP3023885A1 (en) Method and device for storing data
CN106250319B (en) Static code scanning result treating method and apparatus
CN108009261A (en) A kind of method of data synchronization, device and electronic equipment
CN103714086A (en) Method and device used for generating non-relational data base module
CN106598999A (en) Method and device for calculating text theme membership degree
CN104239321B (en) A kind of data processing method and device of Search Engine-Oriented
CN104778179A (en) Data migration test method and system
CN102521713B (en) Data processing equipment and data processing method
CN108009223B (en) Method and device for detecting consistency of transaction data
CN110795758A (en) Non-equidistant histogram publishing method based on differential privacy
CN106445918A (en) Chinese address processing method and system
CN105488212A (en) Data quality detection method and device of duplicated data
CN108228634A (en) A kind of data processing method and device
CN111274125A (en) Log analysis method and device
CN104573132B (en) Song lookup method and device
CN104133836B (en) A kind of method and device realizing change Data Detection
CN106933799A (en) A kind of Chinese word cutting method and device of point of interest POI titles
CN107657050A (en) One kind is based on &#34; with the one-to-one join of conflation algorithm calculating, one-to-many join &#34; contraposition segmentation parallel method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20191210

Address after: P.O. Box 31119, grand exhibition hall, hibiscus street, 802 West Bay Road, Grand Cayman, Cayman Islands

Patentee after: Innovative advanced technology Co., Ltd

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before: Alibaba Group Holding Co., Ltd.

TR01 Transfer of patent right