CN108399151B

CN108399151B - Data comparison system and method

Info

Publication number: CN108399151B
Application number: CN201710065045.0A
Authority: CN
Inventors: 米博会; 魏庆滨; 张磊
Original assignee: Baidu Online Network Technology Beijing Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd
Priority date: 2017-02-06
Filing date: 2017-02-06
Publication date: 2022-02-15
Anticipated expiration: 2037-02-06
Also published as: CN108399151A

Abstract

The invention provides a data comparison system and a data comparison method, wherein the data comparison system comprises a mapping module, a comparison module and a comparison module, wherein the mapping module is used for respectively mapping first data and second data to be compared to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data; the merging module is used for respectively sorting the first key-value pairs and the second key-value pairs to obtain sorted first key-value pairs and sorted second key-value pairs, and merging the sorted first key-value pairs and the sorted second key-value pairs to obtain a merging result; and the reduction module is used for judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data and the second data to be compared according to the judgment result. The method and the device can avoid dependence on the data sequence in the file and effectively improve the data comparison efficiency.

Description

Data comparison system and method

Technical Field

The invention relates to the technical field of computers, in particular to a data comparison system and a data comparison method.

Background

With the development of big data technology in the computer field, the consistency of data in two files needs to be compared in the process of big data application and processing. An application scenario, for example, an item a on an original product line needs to verify the execution correctness after being upgraded, and can be verified by comparing an original program of the item a with an upgraded version program of the item a and determining whether the output contents of the item a are consistent or not under the condition of receiving the same input.

In the related art, the data comparison method performs comparison processing based on a single machine, and the contents of two input files required to be compared are ordered, for example, the file a includes three rows of data: data a, data B, and data C, while file B contains three lines of data: data B, data a, and data C, in this example, the data alignment method in the related art determines: file a and file b are not consistent.

In this way, in the process of comparing big data, the data sequence dependency in the file is high, the single machine execution is difficult, and the comparison efficiency is low.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, an object of the present invention is to provide a data comparison system, which can avoid the dependence on the data sequence in the file, and effectively improve the data comparison efficiency.

Another objective of the present invention is to provide a data comparison method.

Another objective of the present invention is to provide a data comparison apparatus.

It is another object of the invention to propose a non-transitory computer-readable storage medium.

It is a further object of the invention to propose a computer program product.

To achieve the above object, an embodiment of the invention provides a data comparison system, which includes: the mapping module is used for respectively mapping the first data and the second data to be compared to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data; a merging module, configured to sort the first key-value pairs and the second key-value pairs respectively to obtain sorted first key-value pairs and sorted second key-value pairs, and merge the sorted first key-value pairs and the sorted second key-value pairs to obtain a merging result; and the reduction module is used for judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data to be compared with the second data according to the judgment result.

In the data comparison system provided in the embodiment of the first aspect of the present invention, mapping is performed on first data and second data to be compared, so as to obtain first key-value pairs of multiple lines of data in the first data and second key-value pairs of multiple lines of data in the second data, sort the first key-value pairs and the second key-value pairs respectively, obtain sorted first key-value pairs and sorted second key-value pairs, merge the sorted first key-value pairs and the sorted second key-value pairs, obtain a merged result, determine whether value values of the key-value pairs in the merged result are the same, obtain a determination result, compare the first data and the second data to be compared according to the determination result, avoid dependence on a data sequence in a file, and effectively improve data comparison efficiency.

In order to achieve the above object, an embodiment of a data comparison method according to a second aspect of the present invention includes: mapping first data and second data to be compared respectively to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data; respectively sequencing the first key-value pairs and the second key-value pairs to obtain sequenced first key-value pairs and sequenced second key-value pairs, and combining the sequenced first key-value pairs and the sequenced second key-value pairs to obtain a combined result; and judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data and the second data to be compared according to the judgment result.

According to the data comparison method provided by the embodiment of the second aspect of the invention, mapping processing is respectively carried out on the first data and the second data to be compared to obtain the first key value pairs of the multiple lines of data in the first data and the second key value pairs of the multiple lines of data in the second data, the first key value pairs and the second key value pairs are respectively sequenced to obtain the sequenced first key value pairs and the sequenced second key value pairs, the sequenced first key value pairs and the sequenced second key value pairs are combined to obtain a combined result, whether the value values of the key value pairs in the combined result are the same or not is judged to obtain a judgment result, the first data and the second data to be compared are compared according to the judgment result, dependence on the data sequence in a file can be avoided, and data comparison efficiency is effectively improved.

To achieve the above object, a data comparing device according to a third aspect of the present invention includes: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: mapping first data and second data to be compared respectively to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data; respectively sequencing the first key-value pairs and the second key-value pairs to obtain sequenced first key-value pairs and sequenced second key-value pairs, and combining the sequenced first key-value pairs and the sequenced second key-value pairs to obtain a combined result; and judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data and the second data to be compared according to the judgment result.

The data comparison device provided in the embodiment of the third aspect of the present invention obtains the first key value pairs of the multiple lines of data in the first data and the second key value pairs of the multiple lines of data in the second data by mapping the first data and the second data to be compared, sorts the first key value pairs and the second key value pairs respectively to obtain the sorted first key value pairs and the sorted second key value pairs, merges the sorted first key value pairs and the sorted second key value pairs to obtain a merged result, determines whether the value values of the key value pairs in the merged result are the same, obtains a determination result, compares the first data and the second data to be compared according to the determination result, can avoid dependence on the data sequence in the file, and effectively improves data comparison efficiency.

To achieve the above object, a non-transitory computer-readable storage medium according to a fourth aspect of the present invention is provided, where instructions of the storage medium, when executed by a processor of a mobile terminal, enable the mobile terminal to execute a data comparison method, where the method includes: mapping first data and second data to be compared respectively to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data; respectively sequencing the first key-value pairs and the second key-value pairs to obtain sequenced first key-value pairs and sequenced second key-value pairs, and combining the sequenced first key-value pairs and the sequenced second key-value pairs to obtain a combined result; and judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data and the second data to be compared according to the judgment result.

The non-transitory computer-readable storage medium provided in the fourth embodiment of the present invention obtains first key-value pairs of multiple rows of data in first data and second key-value pairs of multiple rows of data in second data by mapping the first data and the second data to be compared, and sorts the first key-value pairs and the second key-value pairs respectively to obtain sorted first key-value pairs and sorted second key-value pairs, and merges the sorted first key-value pairs and the sorted second key-value pairs to obtain a merged result, determines whether value values of the key-value pairs in the merged result are the same to obtain a determination result, and compares the first data and the second data to be compared according to the determination result, so as to avoid dependency on a data sequence in a file and effectively improve data comparison efficiency.

To achieve the above object, a computer program product according to a fifth embodiment of the present invention is provided, in which when instructions are executed by a processor, a data comparison method is performed, and the method includes: mapping first data and second data to be compared respectively to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data; respectively sequencing the first key-value pairs and the second key-value pairs to obtain sequenced first key-value pairs and sequenced second key-value pairs, and combining the sequenced first key-value pairs and the sequenced second key-value pairs to obtain a combined result; and judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data and the second data to be compared according to the judgment result.

In the computer program product provided in the embodiment of the fifth aspect of the present invention, mapping processing is performed on first data and second data to be compared respectively, so as to obtain first key-value pairs of multiple lines of data in the first data and second key-value pairs of multiple lines of data in the second data, sort the first key-value pairs and the second key-value pairs respectively, obtain sorted first key-value pairs and sorted second key-value pairs, merge the sorted first key-value pairs and the sorted second key-value pairs, obtain a merged result, determine whether value values of the key-value pairs in the merged result are the same, obtain a determination result, compare the first data and the second data to be compared according to the determination result, avoid dependence on a data sequence in a file, and effectively improve data comparison efficiency.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic structural diagram of a data comparison system according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a data comparison system according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of data flow in a data alignment system according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a data comparison method according to an embodiment of the present invention;

fig. 5 is a flowchart illustrating a data comparison method according to another embodiment of the invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

Fig. 1 is a schematic structural diagram of a data comparison system according to an embodiment of the present invention.

Referring to fig. 1, the data alignment system includes: the mapping module 100 is configured to map first data and second data to be compared to obtain a first key value pair of multiple rows of data in the first data and a second key value pair of multiple rows of data in the second data; a merging module 200, configured to sort the first key-value pairs and the second key-value pairs respectively to obtain sorted first key-value pairs and sorted second key-value pairs, and merge the sorted first key-value pairs and the sorted second key-value pairs to obtain a merging result; and the reduction module 300 is configured to determine whether the value values of the key value pairs in the merging result are the same, obtain a determination result, and compare the first data and the second data to be compared according to the determination result.

In an embodiment of the present invention, the data alignment system includes: the mapping module 100 is configured to perform mapping processing on the first data and the second data to be compared, respectively, to obtain a first key value pair of multiple rows of data in the first data and a second key value pair of multiple rows of data in the second data.

In the related technology, the data comparison method is based on single machine comparison processing, and the contents of two input files required to be compared are ordered, so that in the process of comparing big data, the dependence of data sequence in the files is high, the single machine execution is difficult, and the comparison efficiency is low.

In the embodiment of the invention, the comparison processing of the big data file can be realized based on the Map-Reduce programming model, and the data comparison efficiency can be effectively improved.

The Map-Reduce programming model can realize parallel operation on large-scale data sets. The Map-Reduce programming model may be deployed in a Distributed File System (HDFS).

It can be understood that, in an application scenario, data to be compared may include multiple lines of data, and therefore, according to the execution specification of the Map-Reduce programming model, in an embodiment of the present invention, a mapping module is configured in the data comparison system, and then the mapping module performs mapping processing on first data to be compared and second data to obtain a first key value pair of the multiple lines of data in the first data and a second key value pair of the multiple lines of data in the second data by executing a Map function in the Map-Reduce programming model written in advance, where the Map function may be executed in parallel, and thus, parallel mapping processing on large data may be implemented.

The first data and the second data to be compared may be stored in different storage paths of the HDFS in a file form in advance, for example, the first data may be stored in the first storage path in the form of a first file, and the second data may be stored in the second storage path in the form of a second file.

Optionally, in some embodiments, referring to fig. 2, the data alignment system comprises: a reading module 400, wherein the reading module 400 is configured to read the first data and the second data from the storage path.

The data to be compared are stored in different storage paths in the HDFS in advance in a file mode, and when the data are compared, the reading module 400 in the data comparison system reads the first data and the second data from the different storage paths, wherein the HDFS has the characteristic of high fault tolerance, is designed to be deployed on low-cost hardware, provides high throughput to access the data in the storage, and is suitable for comparison of large-scale data. In addition, the data in the HDFS is accessed in a stream mode through the HDFS, and parallel comparison processing of big data is effectively guaranteed.

Optionally, in some embodiments, referring to fig. 2, the mapping module 100 includes:

the first mapping sub-module 110 is configured to calculate each line of data in the multiple lines of data of the first data by using a preset encryption algorithm to obtain a first encrypted value corresponding to each line of data, use the corresponding first encrypted value as a key value in the first key value pair, calculate each line of data in the multiple lines of data of the second data by using the preset encryption algorithm to obtain a second encrypted value corresponding to each line of data, and use the corresponding second encrypted value as a key value in the second key value pair.

Optionally, the preset encryption Algorithm is a Message Digest Algorithm (MD 5).

As an example, referring to fig. 3, fig. 3 is a schematic diagram of data flow in a data alignment system in an embodiment of the present invention, where the schematic diagram includes: the first storage path dir1(31), the second storage path dir2(32), the first key-value pair 33, the second key-value pair 34, and the merge result 35, the first storage path dir1 stores the first data, each line of data of the first data is: aaaaaa, bbbbbbbb, ccccc, the second storage path dir2 stores therein second data, each line of data of the second data is: bbbbbb, ccccc, dddddddd, the mapping module 100 in the data comparing system may respectively read the first data and the second data from the first storage path dir1 and the second storage path dir2, calculate each row of data of the first data by using a preset encryption algorithm to obtain a first encryption value corresponding to each row of data, for example, calculate each row of data aaaaaa, bbbbbb, cccc of the first data by using the MD5 algorithm to each row of data aaaaaaaa, bbbbbb, cccc, obtain a first encryption value corresponding to each row of data, i.e., MD5 (aaaa) corresponding to aabbaabbaaaa, MD5 (bbbb) corresponding to bbbbbbbbbb), MD5 (ccccc) corresponding to ccccc, and use the corresponding first encryption value as a key value of the first key value, and similarly, encrypt each row of data of the second data by using the preset encryption algorithm to obtain a second encryption value of each row of data, for example, cccccc and dddddd are calculated for each row of data bbbbbb, cccccc and dddddd by respectively adopting an MD5 algorithm to obtain second encryption values corresponding to each row of data, namely, MD5(bbbbbb) corresponding to bbbb, MD5(cccccc) corresponding to cccccc and MD5(dddddd) corresponding to dddddd, and the corresponding second encryption values are used as key values in the second key value pair.

And the second mapping sub-module 120 is configured to generate a value in the first key value pair according to the storage path of the first data and the data content of each line of data in the plurality of lines of data of the first data, and generate a value in the second key value pair according to the storage path of the second data and the data content of each line of data in the plurality of lines of data of the second data.

As an example, referring to fig. 3, the first storage path dir1 stores therein first data, each line of data of the first data is: aaaaaa, bbbbbb, cccc, then the value in the first key value pair may be generated according to the storage path of the first data and the data content of each line of data in the multiple lines of data of the first data, that is, the value in the first key value pair corresponding to aaaaaa in the first data is: <1, aaaaaa >, the value in the first key-value pair corresponding to bbbbbb in the first data is: <1, bbbbbb >, the value in the first key-value pair corresponding to ccccc in the first data is: <1, ccccc >, similarly, the second storage path dir2 stores the second data, and each line of data of the second data is: bbbbbb, ccccc, dddddd, a value in the second key value pair may be generated according to the storage path of the second data and the data content of each row of data in the plurality of rows of data of the second data, that is, the value in the second key value pair corresponding to bbbb in the second data is: <2, bbbbbb >, the value in the second key-value pair corresponding to ccccc in the second data is: <2, ccccc >, the value in the second key-value pair corresponding to dddddddd in the second data is: <2, dddddd >.

Further, after the mapping module 100 generates the first key-value pair and the second key-value pair, the first key-value pair and the second key-value pair may be written into a local disk of the data alignment system.

In an embodiment of the present invention, the data comparing system further includes: the merging module 200 is configured to sort the first key-value pairs and the second key-value pairs respectively to obtain sorted first key-value pairs and sorted second key-value pairs, and merge the sorted first key-value pairs and the sorted second key-value pairs to obtain a merging result.

Optionally, in some embodiments, referring to fig. 2, the number of the merging results is at least one, and the merging module includes:

the sorting sub-module 210 is configured to sort multiple key value pairs in the first key value pair according to the key value to obtain a sorted first key value pair, and sort multiple key value pairs in the second key value pair according to the key value to obtain a sorted second key value pair.

In the embodiment of the present invention, the sorting submodule 210 may call a partition function to sort the first key value and the key value, and similarly, sort the second key value and the key value.

And the merging submodule 220 is configured to merge the sorted first key value pairs and the key value pairs with the same key value in the sorted second key value pairs to obtain a plurality of merging results.

In an embodiment of the present invention, the merging module 200 may receive a request sent by the reduction module 300, where the request is, for example, an HTTP request, and the request is used to trigger the merging sub-module 220 to merge a sorted first key value pair and a sorted second key value pair, where the key value pair has the same key value, and may obtain the sorted first key value pair and the sorted second key value pair output by the map task from the task tracker tasktacker where the map task is located, and merge the key values, that is, merge the sorted first key value pair and the sorted second key value pair, where the key value pair has the same key value, to obtain a plurality of merging results, as shown in fig. 3. Further, the plurality of merged results after the merging process are input to the reduction module 300 as a response to the HTTP request.

The first key value pairs and the second key value pairs are respectively sequenced to obtain the sequenced first key value pairs and the sequenced second key value pairs, so that the stability of large-scale data comparison operation can be effectively guaranteed. The key value pairs with the same key value in the sorted first key value pair and the sorted second key value pair are combined to obtain a plurality of combined results, dependence on the data sequence in the file can be avoided, and the comparison accuracy is effectively guaranteed.

In an embodiment of the present invention, the data comparing system further includes: and the reduction module 300 is configured to determine whether the value values of the key value pairs in the merging result are the same, obtain a determination result, and compare the first data and the second data to be compared according to the determination result.

Optionally, in some embodiments, referring to fig. 2, the reduction module 300 comprises:

the determining sub-module 310 is configured to determine whether data contents in the value values of the key value pairs of each merging result are the same, and obtain a determining result corresponding to each merging result.

The comparison pair sub-module 320 is configured to not generate the comparison result of the merged result when the corresponding determination result is that the data contents are the same, and generate the comparison result that the first data and the second data have a difference when the corresponding determination result is that the data contents are different, or when the number of the key value pairs in the merged result is one.

It can be understood that, in the embodiment of the present invention, since the plurality of merging results are obtained by merging key-value pairs having the same key-value according to the sorted first key-value pairs and the sorted second key-value pairs, for each merging result, data contents in the value values of the key-value pairs in the merging result may be compared to generate a corresponding comparison result.

For example, through the reduction module 300, when the corresponding determination result is that the data contents are the same, it indicates that the two rows of data in the merged result are the same, a comparison result of the merged result may not be generated, and when the corresponding determination result is that the data contents are different, or when the number of key pairs in the merged result is one, it indicates that the two rows of data in the merged result are different or the data in the merged result only exists in one file, and a comparison result in which the first data and the second data are different may be generated.

Optionally, in some embodiments, referring to fig. 2, the data alignment system further comprises:

the display module 500 is configured to display the storage path and the comparison result when the first data and the second data have a difference.

By displaying the storage path and the comparison result when the first data and the second data are different, the testing personnel can timely know the comparison result, and the user experience is improved.

In this embodiment, mapping processing is performed on first data and second data to be compared respectively to obtain first key value pairs of multiple lines of data in the first data and second key value pairs of multiple lines of data in the second data, the first key value pairs and the second key value pairs are sorted respectively to obtain sorted first key value pairs and sorted second key value pairs, the sorted first key value pairs and the sorted second key value pairs are merged to obtain a merged result, whether value values of the key value pairs in the merged result are the same or not is judged to obtain a judgment result, the first data and the second data to be compared are compared according to the judgment result, dependence on data sequences in files can be avoided, and data comparison efficiency is improved effectively.

Fig. 4 is a flowchart illustrating a data comparison method according to an embodiment of the invention.

Referring to fig. 4, the data alignment method includes:

s41: and mapping the first data and the second data to be compared respectively to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data.

S42: and respectively sequencing the first key-value pairs and the second key-value pairs to obtain sequenced first key-value pairs and sequenced second key-value pairs, and combining the sequenced first key-value pairs and sequenced second key-value pairs to obtain a combined result.

S43: and judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data and the second data to be compared according to the judgment result.

It should be noted that the explanation of the embodiment of the data comparison system in fig. 1 is also applicable to the data comparison method of the embodiment, and the implementation principle is similar, and is not repeated here.

Referring to fig. 5, the data alignment method includes:

s51: the first data and the second data are read from the storage path.

S52: and calculating each line of data in the multiple lines of data of the first data by adopting a preset encryption algorithm to obtain a first encryption value corresponding to each line of data, taking the corresponding first encryption value as a key value in the first key value pair, calculating each line of data in the multiple lines of data of the second data by adopting the preset encryption algorithm to obtain a second encryption value corresponding to each line of data, and taking the corresponding second encryption value as a key value in the second key value pair.

In an embodiment of the present invention, the predetermined encryption Algorithm is a Message Digest Algorithm (MD 5).

S53: and generating a value in the first key value pair according to the storage path of the first data and the data content of each row of data in the plurality of rows of data of the first data, and generating a value in the second key value pair according to the storage path of the second data and the data content of each row of data in the plurality of rows of data of the second data.

S54: and sequencing a plurality of key value pairs in the first key value pair according to the key values to obtain a sequenced first key value pair, and sequencing a plurality of key value pairs in the second key value pair according to the key values to obtain a sequenced second key value pair.

S55: and merging the key value pairs with the same key value in the sorted first key value pair and the sorted second key value pair to obtain a plurality of merging results.

S56: and judging whether the data contents in the value values of the key value pairs of each merging result are the same or not, and obtaining a judgment result corresponding to each merging result.

S57: and when the corresponding judgment results are that the data contents are the same, not generating a comparison result of the merged result, and when the corresponding judgment results are that the data contents are different, or when the number of the key value pairs in the merged result is one, generating a comparison result that the first data and the second data have differences.

S58: and when the first data and the second data have differences, displaying the storage path and the comparison result.

It should be noted that the explanation of the embodiment of the data comparison system in the foregoing fig. 1-3 is also applicable to the data comparison method of the embodiment, and the implementation principle is similar, and is not repeated here.

In the embodiment, the data to be compared are stored in different storage paths in the HDFS in advance in the form of files, and when the data are compared, the first data and the second data are read from the different storage paths, wherein the HDFS has the characteristic of high fault tolerance, is designed to be deployed on low-cost hardware, provides high throughput to access the data in the storage, and is suitable for comparison of large-scale data. In addition, the data in the HDFS is accessed in a stream mode through the HDFS, and parallel comparison processing of big data is effectively guaranteed. The first key value pairs and the second key value pairs are respectively sequenced to obtain the sequenced first key value pairs and the sequenced second key value pairs, so that the stability of large-scale data comparison operation can be effectively guaranteed. The key value pairs with the same key value in the sorted first key value pair and the sorted second key value pair are combined to obtain a plurality of combined results, dependence on the data sequence in the file can be avoided, and the comparison accuracy is effectively guaranteed. By displaying the storage path and the comparison result when the first data and the second data are different, the testing personnel can timely know the comparison result, and the user experience is improved.

It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A data alignment system, comprising:

the mapping module is used for respectively mapping the first data and the second data to be compared to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data;

a merging module, configured to sort the first key-value pairs and the second key-value pairs respectively to obtain sorted first key-value pairs and sorted second key-value pairs, and merge the sorted first key-value pairs and the sorted second key-value pairs to obtain a merging result;

the reduction module is used for judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data to be compared with the second data according to the judgment result;

wherein, the number of the merging results is at least one, and the merging module comprises:

the sorting sub-module is used for sorting a plurality of key value pairs in the first key value pair according to key values to obtain a sorted first key value pair, and sorting a plurality of key value pairs in the second key value pair according to key values to obtain a sorted second key value pair;

the merging submodule is used for merging the key value pairs with the same key value in the sorted first key value pair and the sorted second key value pair to obtain a plurality of merging results;

wherein the mapping module comprises:

the first mapping sub-module is used for calculating each row of data in the multiple rows of data of the first data by adopting a preset encryption algorithm to obtain a first encryption value corresponding to each row of data, taking the corresponding first encryption value as a key value in a first key value pair, calculating each row of data in the multiple rows of data of the second data by adopting the preset encryption algorithm to obtain a second encryption value corresponding to each row of data, and taking the corresponding second encryption value as a key value in a second key value pair;

and the second mapping sub-module is used for generating a value in the first key value pair according to the storage path of the first data in the distributed file system HDFS and the data content of each line of data in the plurality of lines of data of the first data, and generating a value in the second key value pair according to the storage path of the second data in the distributed file system HDFS and the data content of each line of data in the plurality of lines of data of the second data.

2. The data alignment system of claim 1, wherein the reduction module comprises:

the judgment submodule is used for judging whether the data contents in the value values of the key value pairs of each merging result are the same or not to obtain a judgment result corresponding to each merging result;

and the comparison submodule is used for not generating the comparison result of the merged result when the corresponding judgment result is that the data contents are the same, and generating the comparison result that the first data and the second data have differences when the corresponding judgment result is that the data contents are different, or the number of the key pairs in the merged result is one.

3. The data alignment system of claim 2, further comprising:

and the display module is used for displaying the storage path and the comparison result when the first data and the second data have differences.

4. The data alignment system of claim 1, further comprising:

a reading module, configured to read the first data and the second data from the storage path.

5. The data alignment system of claim 1, wherein the predetermined encryption algorithm is a message digest algorithm.

6. A method of data alignment, comprising:

mapping first data and second data to be compared respectively to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data;

respectively sequencing the first key-value pairs and the second key-value pairs to obtain sequenced first key-value pairs and sequenced second key-value pairs, and combining the sequenced first key-value pairs and the sequenced second key-value pairs to obtain a combined result;

judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data and the second data to be compared according to the judgment result;

wherein the number of the merging results is at least one, the first key-value pairs and the second key-value pairs are respectively sorted to obtain sorted first key-value pairs and sorted second key-value pairs, and the sorted first key-value pairs and the sorted second key-value pairs are merged to obtain merging results, including:

sorting a plurality of key value pairs in the first key value pair according to key values to obtain a sorted first key value pair, and sorting a plurality of key value pairs in the second key value pair according to key values to obtain a sorted second key value pair;

merging key value pairs with the same key value in the sorted first key value pair and the sorted second key value pair to obtain a plurality of merging results;

the method further comprises the following steps:

calculating each line of data in the multiple lines of data of the first data by adopting a preset encryption algorithm to obtain a first encryption value corresponding to each line of data, taking the corresponding first encryption value as a key value in a first key value pair, calculating each line of data in the multiple lines of data of the second data by adopting the preset encryption algorithm to obtain a second encryption value corresponding to each line of data, and taking the corresponding second encryption value as a key value in a second key value pair;

and generating a value in the first key value pair according to a storage path of the first data in the distributed file system HDFS and data content of each line of data in the plurality of lines of data of the first data, and generating a value in the second key value pair according to a storage path of the second data in the distributed file system HDFS and data content of each line of data in the plurality of lines of data of the second data.

7. The data comparison method according to claim 6, wherein the determining whether the value values of the key value pairs in the merged result are the same to obtain a determination result, and comparing the first data and the second data to be compared according to the determination result comprises:

judging whether the data content in the value of each key value pair is the same or not according to each merging result to obtain a judgment result corresponding to each merging result;

and when the corresponding judgment result is that the data contents are the same, not generating a comparison result of the merged result, and when the corresponding judgment result is that the data contents are different, or when the number of key pairs in the merged result is one, generating a comparison result that the first data and the second data have a difference.

8. The method of data alignment of claim 7, further comprising:

and when the first data and the second data have differences, displaying the storage path and the comparison result.

9. The method of data alignment of claim 6, further comprising:

reading the first data and the second data from the storage path.

10. The data comparison method of claim 6, wherein the predetermined encryption algorithm is a message digest algorithm.