CN108399151B - Data comparison system and method - Google Patents

Data comparison system and method Download PDF

Info

Publication number
CN108399151B
CN108399151B CN201710065045.0A CN201710065045A CN108399151B CN 108399151 B CN108399151 B CN 108399151B CN 201710065045 A CN201710065045 A CN 201710065045A CN 108399151 B CN108399151 B CN 108399151B
Authority
CN
China
Prior art keywords
data
key
value
value pairs
key value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710065045.0A
Other languages
Chinese (zh)
Other versions
CN108399151A (en
Inventor
米博会
魏庆滨
张磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Original Assignee
Baidu Online Network Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baidu Online Network Technology Beijing Co Ltd filed Critical Baidu Online Network Technology Beijing Co Ltd
Priority to CN201710065045.0A priority Critical patent/CN108399151B/en
Publication of CN108399151A publication Critical patent/CN108399151A/en
Application granted granted Critical
Publication of CN108399151B publication Critical patent/CN108399151B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data comparison system and a data comparison method, wherein the data comparison system comprises a mapping module, a comparison module and a comparison module, wherein the mapping module is used for respectively mapping first data and second data to be compared to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data; the merging module is used for respectively sorting the first key-value pairs and the second key-value pairs to obtain sorted first key-value pairs and sorted second key-value pairs, and merging the sorted first key-value pairs and the sorted second key-value pairs to obtain a merging result; and the reduction module is used for judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data and the second data to be compared according to the judgment result. The method and the device can avoid dependence on the data sequence in the file and effectively improve the data comparison efficiency.

Description

Data comparison system and method
Technical Field
The invention relates to the technical field of computers, in particular to a data comparison system and a data comparison method.
Background
With the development of big data technology in the computer field, the consistency of data in two files needs to be compared in the process of big data application and processing. An application scenario, for example, an item a on an original product line needs to verify the execution correctness after being upgraded, and can be verified by comparing an original program of the item a with an upgraded version program of the item a and determining whether the output contents of the item a are consistent or not under the condition of receiving the same input.
In the related art, the data comparison method performs comparison processing based on a single machine, and the contents of two input files required to be compared are ordered, for example, the file a includes three rows of data: data a, data B, and data C, while file B contains three lines of data: data B, data a, and data C, in this example, the data alignment method in the related art determines: file a and file b are not consistent.
In this way, in the process of comparing big data, the data sequence dependency in the file is high, the single machine execution is difficult, and the comparison efficiency is low.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, an object of the present invention is to provide a data comparison system, which can avoid the dependence on the data sequence in the file, and effectively improve the data comparison efficiency.
Another objective of the present invention is to provide a data comparison method.
Another objective of the present invention is to provide a data comparison apparatus.
It is another object of the invention to propose a non-transitory computer-readable storage medium.
It is a further object of the invention to propose a computer program product.
To achieve the above object, an embodiment of the invention provides a data comparison system, which includes: the mapping module is used for respectively mapping the first data and the second data to be compared to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data; a merging module, configured to sort the first key-value pairs and the second key-value pairs respectively to obtain sorted first key-value pairs and sorted second key-value pairs, and merge the sorted first key-value pairs and the sorted second key-value pairs to obtain a merging result; and the reduction module is used for judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data to be compared with the second data according to the judgment result.
In the data comparison system provided in the embodiment of the first aspect of the present invention, mapping is performed on first data and second data to be compared, so as to obtain first key-value pairs of multiple lines of data in the first data and second key-value pairs of multiple lines of data in the second data, sort the first key-value pairs and the second key-value pairs respectively, obtain sorted first key-value pairs and sorted second key-value pairs, merge the sorted first key-value pairs and the sorted second key-value pairs, obtain a merged result, determine whether value values of the key-value pairs in the merged result are the same, obtain a determination result, compare the first data and the second data to be compared according to the determination result, avoid dependence on a data sequence in a file, and effectively improve data comparison efficiency.
In order to achieve the above object, an embodiment of a data comparison method according to a second aspect of the present invention includes: mapping first data and second data to be compared respectively to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data; respectively sequencing the first key-value pairs and the second key-value pairs to obtain sequenced first key-value pairs and sequenced second key-value pairs, and combining the sequenced first key-value pairs and the sequenced second key-value pairs to obtain a combined result; and judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data and the second data to be compared according to the judgment result.
According to the data comparison method provided by the embodiment of the second aspect of the invention, mapping processing is respectively carried out on the first data and the second data to be compared to obtain the first key value pairs of the multiple lines of data in the first data and the second key value pairs of the multiple lines of data in the second data, the first key value pairs and the second key value pairs are respectively sequenced to obtain the sequenced first key value pairs and the sequenced second key value pairs, the sequenced first key value pairs and the sequenced second key value pairs are combined to obtain a combined result, whether the value values of the key value pairs in the combined result are the same or not is judged to obtain a judgment result, the first data and the second data to be compared are compared according to the judgment result, dependence on the data sequence in a file can be avoided, and data comparison efficiency is effectively improved.
To achieve the above object, a data comparing device according to a third aspect of the present invention includes: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to: mapping first data and second data to be compared respectively to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data; respectively sequencing the first key-value pairs and the second key-value pairs to obtain sequenced first key-value pairs and sequenced second key-value pairs, and combining the sequenced first key-value pairs and the sequenced second key-value pairs to obtain a combined result; and judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data and the second data to be compared according to the judgment result.
The data comparison device provided in the embodiment of the third aspect of the present invention obtains the first key value pairs of the multiple lines of data in the first data and the second key value pairs of the multiple lines of data in the second data by mapping the first data and the second data to be compared, sorts the first key value pairs and the second key value pairs respectively to obtain the sorted first key value pairs and the sorted second key value pairs, merges the sorted first key value pairs and the sorted second key value pairs to obtain a merged result, determines whether the value values of the key value pairs in the merged result are the same, obtains a determination result, compares the first data and the second data to be compared according to the determination result, can avoid dependence on the data sequence in the file, and effectively improves data comparison efficiency.
To achieve the above object, a non-transitory computer-readable storage medium according to a fourth aspect of the present invention is provided, where instructions of the storage medium, when executed by a processor of a mobile terminal, enable the mobile terminal to execute a data comparison method, where the method includes: mapping first data and second data to be compared respectively to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data; respectively sequencing the first key-value pairs and the second key-value pairs to obtain sequenced first key-value pairs and sequenced second key-value pairs, and combining the sequenced first key-value pairs and the sequenced second key-value pairs to obtain a combined result; and judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data and the second data to be compared according to the judgment result.
The non-transitory computer-readable storage medium provided in the fourth embodiment of the present invention obtains first key-value pairs of multiple rows of data in first data and second key-value pairs of multiple rows of data in second data by mapping the first data and the second data to be compared, and sorts the first key-value pairs and the second key-value pairs respectively to obtain sorted first key-value pairs and sorted second key-value pairs, and merges the sorted first key-value pairs and the sorted second key-value pairs to obtain a merged result, determines whether value values of the key-value pairs in the merged result are the same to obtain a determination result, and compares the first data and the second data to be compared according to the determination result, so as to avoid dependency on a data sequence in a file and effectively improve data comparison efficiency.
To achieve the above object, a computer program product according to a fifth embodiment of the present invention is provided, in which when instructions are executed by a processor, a data comparison method is performed, and the method includes: mapping first data and second data to be compared respectively to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data; respectively sequencing the first key-value pairs and the second key-value pairs to obtain sequenced first key-value pairs and sequenced second key-value pairs, and combining the sequenced first key-value pairs and the sequenced second key-value pairs to obtain a combined result; and judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data and the second data to be compared according to the judgment result.
In the computer program product provided in the embodiment of the fifth aspect of the present invention, mapping processing is performed on first data and second data to be compared respectively, so as to obtain first key-value pairs of multiple lines of data in the first data and second key-value pairs of multiple lines of data in the second data, sort the first key-value pairs and the second key-value pairs respectively, obtain sorted first key-value pairs and sorted second key-value pairs, merge the sorted first key-value pairs and the sorted second key-value pairs, obtain a merged result, determine whether value values of the key-value pairs in the merged result are the same, obtain a determination result, compare the first data and the second data to be compared according to the determination result, avoid dependence on a data sequence in a file, and effectively improve data comparison efficiency.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic structural diagram of a data comparison system according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a data comparison system according to another embodiment of the present invention;
FIG. 3 is a schematic diagram of data flow in a data alignment system according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a data comparison method according to an embodiment of the present invention;
fig. 5 is a flowchart illustrating a data comparison method according to another embodiment of the invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.
Fig. 1 is a schematic structural diagram of a data comparison system according to an embodiment of the present invention.
With the development of big data technology in the computer field, the consistency of data in two files needs to be compared in the process of big data application and processing. An application scenario, for example, an item a on an original product line needs to verify the execution correctness after being upgraded, and can be verified by comparing an original program of the item a with an upgraded version program of the item a and determining whether the output contents of the item a are consistent or not under the condition of receiving the same input.
Referring to fig. 1, the data alignment system includes: the mapping module 100 is configured to map first data and second data to be compared to obtain a first key value pair of multiple rows of data in the first data and a second key value pair of multiple rows of data in the second data; a merging module 200, configured to sort the first key-value pairs and the second key-value pairs respectively to obtain sorted first key-value pairs and sorted second key-value pairs, and merge the sorted first key-value pairs and the sorted second key-value pairs to obtain a merging result; and the reduction module 300 is configured to determine whether the value values of the key value pairs in the merging result are the same, obtain a determination result, and compare the first data and the second data to be compared according to the determination result.
In an embodiment of the present invention, the data alignment system includes: the mapping module 100 is configured to perform mapping processing on the first data and the second data to be compared, respectively, to obtain a first key value pair of multiple rows of data in the first data and a second key value pair of multiple rows of data in the second data.
In the related technology, the data comparison method is based on single machine comparison processing, and the contents of two input files required to be compared are ordered, so that in the process of comparing big data, the dependence of data sequence in the files is high, the single machine execution is difficult, and the comparison efficiency is low.
In the embodiment of the invention, the comparison processing of the big data file can be realized based on the Map-Reduce programming model, and the data comparison efficiency can be effectively improved.
The Map-Reduce programming model can realize parallel operation on large-scale data sets. The Map-Reduce programming model may be deployed in a Distributed File System (HDFS).
It can be understood that, in an application scenario, data to be compared may include multiple lines of data, and therefore, according to the execution specification of the Map-Reduce programming model, in an embodiment of the present invention, a mapping module is configured in the data comparison system, and then the mapping module performs mapping processing on first data to be compared and second data to obtain a first key value pair of the multiple lines of data in the first data and a second key value pair of the multiple lines of data in the second data by executing a Map function in the Map-Reduce programming model written in advance, where the Map function may be executed in parallel, and thus, parallel mapping processing on large data may be implemented.
The first data and the second data to be compared may be stored in different storage paths of the HDFS in a file form in advance, for example, the first data may be stored in the first storage path in the form of a first file, and the second data may be stored in the second storage path in the form of a second file.
Optionally, in some embodiments, referring to fig. 2, the data alignment system comprises: a reading module 400, wherein the reading module 400 is configured to read the first data and the second data from the storage path.
The data to be compared are stored in different storage paths in the HDFS in advance in a file mode, and when the data are compared, the reading module 400 in the data comparison system reads the first data and the second data from the different storage paths, wherein the HDFS has the characteristic of high fault tolerance, is designed to be deployed on low-cost hardware, provides high throughput to access the data in the storage, and is suitable for comparison of large-scale data. In addition, the data in the HDFS is accessed in a stream mode through the HDFS, and parallel comparison processing of big data is effectively guaranteed.
Optionally, in some embodiments, referring to fig. 2, the mapping module 100 includes:
the first mapping sub-module 110 is configured to calculate each line of data in the multiple lines of data of the first data by using a preset encryption algorithm to obtain a first encrypted value corresponding to each line of data, use the corresponding first encrypted value as a key value in the first key value pair, calculate each line of data in the multiple lines of data of the second data by using the preset encryption algorithm to obtain a second encrypted value corresponding to each line of data, and use the corresponding second encrypted value as a key value in the second key value pair.
Optionally, the preset encryption Algorithm is a Message Digest Algorithm (MD 5).
As an example, referring to fig. 3, fig. 3 is a schematic diagram of data flow in a data alignment system in an embodiment of the present invention, where the schematic diagram includes: the first storage path dir1(31), the second storage path dir2(32), the first key-value pair 33, the second key-value pair 34, and the merge result 35, the first storage path dir1 stores the first data, each line of data of the first data is: aaaaaa, bbbbbbbb, ccccc, the second storage path dir2 stores therein second data, each line of data of the second data is: bbbbbb, ccccc, dddddddd, the mapping module 100 in the data comparing system may respectively read the first data and the second data from the first storage path dir1 and the second storage path dir2, calculate each row of data of the first data by using a preset encryption algorithm to obtain a first encryption value corresponding to each row of data, for example, calculate each row of data aaaaaa, bbbbbb, cccc of the first data by using the MD5 algorithm to each row of data aaaaaaaa, bbbbbb, cccc, obtain a first encryption value corresponding to each row of data, i.e., MD5 (aaaa) corresponding to aabbaabbaaaa, MD5 (bbbb) corresponding to bbbbbbbbbb), MD5 (ccccc) corresponding to ccccc, and use the corresponding first encryption value as a key value of the first key value, and similarly, encrypt each row of data of the second data by using the preset encryption algorithm to obtain a second encryption value of each row of data, for example, cccccc and dddddd are calculated for each row of data bbbbbb, cccccc and dddddd by respectively adopting an MD5 algorithm to obtain second encryption values corresponding to each row of data, namely, MD5(bbbbbb) corresponding to bbbb, MD5(cccccc) corresponding to cccccc and MD5(dddddd) corresponding to dddddd, and the corresponding second encryption values are used as key values in the second key value pair.
And the second mapping sub-module 120 is configured to generate a value in the first key value pair according to the storage path of the first data and the data content of each line of data in the plurality of lines of data of the first data, and generate a value in the second key value pair according to the storage path of the second data and the data content of each line of data in the plurality of lines of data of the second data.
As an example, referring to fig. 3, the first storage path dir1 stores therein first data, each line of data of the first data is: aaaaaa, bbbbbb, cccc, then the value in the first key value pair may be generated according to the storage path of the first data and the data content of each line of data in the multiple lines of data of the first data, that is, the value in the first key value pair corresponding to aaaaaa in the first data is: <1, aaaaaa >, the value in the first key-value pair corresponding to bbbbbb in the first data is: <1, bbbbbb >, the value in the first key-value pair corresponding to ccccc in the first data is: <1, ccccc >, similarly, the second storage path dir2 stores the second data, and each line of data of the second data is: bbbbbb, ccccc, dddddd, a value in the second key value pair may be generated according to the storage path of the second data and the data content of each row of data in the plurality of rows of data of the second data, that is, the value in the second key value pair corresponding to bbbb in the second data is: <2, bbbbbb >, the value in the second key-value pair corresponding to ccccc in the second data is: <2, ccccc >, the value in the second key-value pair corresponding to dddddddd in the second data is: <2, dddddd >.
Further, after the mapping module 100 generates the first key-value pair and the second key-value pair, the first key-value pair and the second key-value pair may be written into a local disk of the data alignment system.
In an embodiment of the present invention, the data comparing system further includes: the merging module 200 is configured to sort the first key-value pairs and the second key-value pairs respectively to obtain sorted first key-value pairs and sorted second key-value pairs, and merge the sorted first key-value pairs and the sorted second key-value pairs to obtain a merging result.
Optionally, in some embodiments, referring to fig. 2, the number of the merging results is at least one, and the merging module includes:
the sorting sub-module 210 is configured to sort multiple key value pairs in the first key value pair according to the key value to obtain a sorted first key value pair, and sort multiple key value pairs in the second key value pair according to the key value to obtain a sorted second key value pair.
In the embodiment of the present invention, the sorting submodule 210 may call a partition function to sort the first key value and the key value, and similarly, sort the second key value and the key value.
And the merging submodule 220 is configured to merge the sorted first key value pairs and the key value pairs with the same key value in the sorted second key value pairs to obtain a plurality of merging results.
In an embodiment of the present invention, the merging module 200 may receive a request sent by the reduction module 300, where the request is, for example, an HTTP request, and the request is used to trigger the merging sub-module 220 to merge a sorted first key value pair and a sorted second key value pair, where the key value pair has the same key value, and may obtain the sorted first key value pair and the sorted second key value pair output by the map task from the task tracker tasktacker where the map task is located, and merge the key values, that is, merge the sorted first key value pair and the sorted second key value pair, where the key value pair has the same key value, to obtain a plurality of merging results, as shown in fig. 3. Further, the plurality of merged results after the merging process are input to the reduction module 300 as a response to the HTTP request.
The first key value pairs and the second key value pairs are respectively sequenced to obtain the sequenced first key value pairs and the sequenced second key value pairs, so that the stability of large-scale data comparison operation can be effectively guaranteed. The key value pairs with the same key value in the sorted first key value pair and the sorted second key value pair are combined to obtain a plurality of combined results, dependence on the data sequence in the file can be avoided, and the comparison accuracy is effectively guaranteed.
In an embodiment of the present invention, the data comparing system further includes: and the reduction module 300 is configured to determine whether the value values of the key value pairs in the merging result are the same, obtain a determination result, and compare the first data and the second data to be compared according to the determination result.
Optionally, in some embodiments, referring to fig. 2, the reduction module 300 comprises:
the determining sub-module 310 is configured to determine whether data contents in the value values of the key value pairs of each merging result are the same, and obtain a determining result corresponding to each merging result.
The comparison pair sub-module 320 is configured to not generate the comparison result of the merged result when the corresponding determination result is that the data contents are the same, and generate the comparison result that the first data and the second data have a difference when the corresponding determination result is that the data contents are different, or when the number of the key value pairs in the merged result is one.
It can be understood that, in the embodiment of the present invention, since the plurality of merging results are obtained by merging key-value pairs having the same key-value according to the sorted first key-value pairs and the sorted second key-value pairs, for each merging result, data contents in the value values of the key-value pairs in the merging result may be compared to generate a corresponding comparison result.
For example, through the reduction module 300, when the corresponding determination result is that the data contents are the same, it indicates that the two rows of data in the merged result are the same, a comparison result of the merged result may not be generated, and when the corresponding determination result is that the data contents are different, or when the number of key pairs in the merged result is one, it indicates that the two rows of data in the merged result are different or the data in the merged result only exists in one file, and a comparison result in which the first data and the second data are different may be generated.
Optionally, in some embodiments, referring to fig. 2, the data alignment system further comprises:
the display module 500 is configured to display the storage path and the comparison result when the first data and the second data have a difference.
By displaying the storage path and the comparison result when the first data and the second data are different, the testing personnel can timely know the comparison result, and the user experience is improved.
In this embodiment, mapping processing is performed on first data and second data to be compared respectively to obtain first key value pairs of multiple lines of data in the first data and second key value pairs of multiple lines of data in the second data, the first key value pairs and the second key value pairs are sorted respectively to obtain sorted first key value pairs and sorted second key value pairs, the sorted first key value pairs and the sorted second key value pairs are merged to obtain a merged result, whether value values of the key value pairs in the merged result are the same or not is judged to obtain a judgment result, the first data and the second data to be compared are compared according to the judgment result, dependence on data sequences in files can be avoided, and data comparison efficiency is improved effectively.
Fig. 4 is a flowchart illustrating a data comparison method according to an embodiment of the invention.
Referring to fig. 4, the data alignment method includes:
s41: and mapping the first data and the second data to be compared respectively to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data.
S42: and respectively sequencing the first key-value pairs and the second key-value pairs to obtain sequenced first key-value pairs and sequenced second key-value pairs, and combining the sequenced first key-value pairs and sequenced second key-value pairs to obtain a combined result.
S43: and judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data and the second data to be compared according to the judgment result.
It should be noted that the explanation of the embodiment of the data comparison system in fig. 1 is also applicable to the data comparison method of the embodiment, and the implementation principle is similar, and is not repeated here.
In this embodiment, mapping processing is performed on first data and second data to be compared respectively to obtain first key value pairs of multiple lines of data in the first data and second key value pairs of multiple lines of data in the second data, the first key value pairs and the second key value pairs are sorted respectively to obtain sorted first key value pairs and sorted second key value pairs, the sorted first key value pairs and the sorted second key value pairs are merged to obtain a merged result, whether value values of the key value pairs in the merged result are the same or not is judged to obtain a judgment result, the first data and the second data to be compared are compared according to the judgment result, dependence on data sequences in files can be avoided, and data comparison efficiency is improved effectively.
Fig. 5 is a flowchart illustrating a data comparison method according to another embodiment of the invention.
Referring to fig. 5, the data alignment method includes:
s51: the first data and the second data are read from the storage path.
S52: and calculating each line of data in the multiple lines of data of the first data by adopting a preset encryption algorithm to obtain a first encryption value corresponding to each line of data, taking the corresponding first encryption value as a key value in the first key value pair, calculating each line of data in the multiple lines of data of the second data by adopting the preset encryption algorithm to obtain a second encryption value corresponding to each line of data, and taking the corresponding second encryption value as a key value in the second key value pair.
In an embodiment of the present invention, the predetermined encryption Algorithm is a Message Digest Algorithm (MD 5).
S53: and generating a value in the first key value pair according to the storage path of the first data and the data content of each row of data in the plurality of rows of data of the first data, and generating a value in the second key value pair according to the storage path of the second data and the data content of each row of data in the plurality of rows of data of the second data.
S54: and sequencing a plurality of key value pairs in the first key value pair according to the key values to obtain a sequenced first key value pair, and sequencing a plurality of key value pairs in the second key value pair according to the key values to obtain a sequenced second key value pair.
S55: and merging the key value pairs with the same key value in the sorted first key value pair and the sorted second key value pair to obtain a plurality of merging results.
S56: and judging whether the data contents in the value values of the key value pairs of each merging result are the same or not, and obtaining a judgment result corresponding to each merging result.
S57: and when the corresponding judgment results are that the data contents are the same, not generating a comparison result of the merged result, and when the corresponding judgment results are that the data contents are different, or when the number of the key value pairs in the merged result is one, generating a comparison result that the first data and the second data have differences.
S58: and when the first data and the second data have differences, displaying the storage path and the comparison result.
It should be noted that the explanation of the embodiment of the data comparison system in the foregoing fig. 1-3 is also applicable to the data comparison method of the embodiment, and the implementation principle is similar, and is not repeated here.
In the embodiment, the data to be compared are stored in different storage paths in the HDFS in advance in the form of files, and when the data are compared, the first data and the second data are read from the different storage paths, wherein the HDFS has the characteristic of high fault tolerance, is designed to be deployed on low-cost hardware, provides high throughput to access the data in the storage, and is suitable for comparison of large-scale data. In addition, the data in the HDFS is accessed in a stream mode through the HDFS, and parallel comparison processing of big data is effectively guaranteed. The first key value pairs and the second key value pairs are respectively sequenced to obtain the sequenced first key value pairs and the sequenced second key value pairs, so that the stability of large-scale data comparison operation can be effectively guaranteed. The key value pairs with the same key value in the sorted first key value pair and the sorted second key value pair are combined to obtain a plurality of combined results, dependence on the data sequence in the file can be avoided, and the comparison accuracy is effectively guaranteed. By displaying the storage path and the comparison result when the first data and the second data are different, the testing personnel can timely know the comparison result, and the user experience is improved.
It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (10)

1. A data alignment system, comprising:
the mapping module is used for respectively mapping the first data and the second data to be compared to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data;
a merging module, configured to sort the first key-value pairs and the second key-value pairs respectively to obtain sorted first key-value pairs and sorted second key-value pairs, and merge the sorted first key-value pairs and the sorted second key-value pairs to obtain a merging result;
the reduction module is used for judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data to be compared with the second data according to the judgment result;
wherein, the number of the merging results is at least one, and the merging module comprises:
the sorting sub-module is used for sorting a plurality of key value pairs in the first key value pair according to key values to obtain a sorted first key value pair, and sorting a plurality of key value pairs in the second key value pair according to key values to obtain a sorted second key value pair;
the merging submodule is used for merging the key value pairs with the same key value in the sorted first key value pair and the sorted second key value pair to obtain a plurality of merging results;
wherein the mapping module comprises:
the first mapping sub-module is used for calculating each row of data in the multiple rows of data of the first data by adopting a preset encryption algorithm to obtain a first encryption value corresponding to each row of data, taking the corresponding first encryption value as a key value in a first key value pair, calculating each row of data in the multiple rows of data of the second data by adopting the preset encryption algorithm to obtain a second encryption value corresponding to each row of data, and taking the corresponding second encryption value as a key value in a second key value pair;
and the second mapping sub-module is used for generating a value in the first key value pair according to the storage path of the first data in the distributed file system HDFS and the data content of each line of data in the plurality of lines of data of the first data, and generating a value in the second key value pair according to the storage path of the second data in the distributed file system HDFS and the data content of each line of data in the plurality of lines of data of the second data.
2. The data alignment system of claim 1, wherein the reduction module comprises:
the judgment submodule is used for judging whether the data contents in the value values of the key value pairs of each merging result are the same or not to obtain a judgment result corresponding to each merging result;
and the comparison submodule is used for not generating the comparison result of the merged result when the corresponding judgment result is that the data contents are the same, and generating the comparison result that the first data and the second data have differences when the corresponding judgment result is that the data contents are different, or the number of the key pairs in the merged result is one.
3. The data alignment system of claim 2, further comprising:
and the display module is used for displaying the storage path and the comparison result when the first data and the second data have differences.
4. The data alignment system of claim 1, further comprising:
a reading module, configured to read the first data and the second data from the storage path.
5. The data alignment system of claim 1, wherein the predetermined encryption algorithm is a message digest algorithm.
6. A method of data alignment, comprising:
mapping first data and second data to be compared respectively to obtain a first key value pair of a plurality of rows of data in the first data and a second key value pair of a plurality of rows of data in the second data;
respectively sequencing the first key-value pairs and the second key-value pairs to obtain sequenced first key-value pairs and sequenced second key-value pairs, and combining the sequenced first key-value pairs and the sequenced second key-value pairs to obtain a combined result;
judging whether the value values of the key value pairs in the merging result are the same or not to obtain a judgment result, and comparing the first data and the second data to be compared according to the judgment result;
wherein the number of the merging results is at least one, the first key-value pairs and the second key-value pairs are respectively sorted to obtain sorted first key-value pairs and sorted second key-value pairs, and the sorted first key-value pairs and the sorted second key-value pairs are merged to obtain merging results, including:
sorting a plurality of key value pairs in the first key value pair according to key values to obtain a sorted first key value pair, and sorting a plurality of key value pairs in the second key value pair according to key values to obtain a sorted second key value pair;
merging key value pairs with the same key value in the sorted first key value pair and the sorted second key value pair to obtain a plurality of merging results;
the method further comprises the following steps:
calculating each line of data in the multiple lines of data of the first data by adopting a preset encryption algorithm to obtain a first encryption value corresponding to each line of data, taking the corresponding first encryption value as a key value in a first key value pair, calculating each line of data in the multiple lines of data of the second data by adopting the preset encryption algorithm to obtain a second encryption value corresponding to each line of data, and taking the corresponding second encryption value as a key value in a second key value pair;
and generating a value in the first key value pair according to a storage path of the first data in the distributed file system HDFS and data content of each line of data in the plurality of lines of data of the first data, and generating a value in the second key value pair according to a storage path of the second data in the distributed file system HDFS and data content of each line of data in the plurality of lines of data of the second data.
7. The data comparison method according to claim 6, wherein the determining whether the value values of the key value pairs in the merged result are the same to obtain a determination result, and comparing the first data and the second data to be compared according to the determination result comprises:
judging whether the data content in the value of each key value pair is the same or not according to each merging result to obtain a judgment result corresponding to each merging result;
and when the corresponding judgment result is that the data contents are the same, not generating a comparison result of the merged result, and when the corresponding judgment result is that the data contents are different, or when the number of key pairs in the merged result is one, generating a comparison result that the first data and the second data have a difference.
8. The method of data alignment of claim 7, further comprising:
and when the first data and the second data have differences, displaying the storage path and the comparison result.
9. The method of data alignment of claim 6, further comprising:
reading the first data and the second data from the storage path.
10. The data comparison method of claim 6, wherein the predetermined encryption algorithm is a message digest algorithm.
CN201710065045.0A 2017-02-06 2017-02-06 Data comparison system and method Active CN108399151B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710065045.0A CN108399151B (en) 2017-02-06 2017-02-06 Data comparison system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710065045.0A CN108399151B (en) 2017-02-06 2017-02-06 Data comparison system and method

Publications (2)

Publication Number Publication Date
CN108399151A CN108399151A (en) 2018-08-14
CN108399151B true CN108399151B (en) 2022-02-15

Family

ID=63093510

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710065045.0A Active CN108399151B (en) 2017-02-06 2017-02-06 Data comparison system and method

Country Status (1)

Country Link
CN (1) CN108399151B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388371B (en) * 2018-09-26 2021-01-26 中兴飞流信息科技有限公司 Data sorting method, system, co-processing device and main processing device
CN111241144B (en) * 2018-11-28 2024-01-26 阿里巴巴集团控股有限公司 Data processing method and system
CN110413960B (en) * 2019-06-19 2023-03-28 平安银行股份有限公司 File comparison method and device, computer equipment and computer readable storage medium
CN111061720B (en) * 2020-03-12 2021-05-07 支付宝(杭州)信息技术有限公司 Data screening method and device and electronic equipment
CN111528798A (en) * 2020-04-27 2020-08-14 湖北中医药高等专科学校 Olfactory detection system and method for medical ophthalmology and otorhinolaryngology
CN111581942B (en) * 2020-06-12 2023-06-27 上海通联金融服务有限公司 Data file comparison method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794223A (en) * 2010-02-03 2010-08-04 南京联创科技集团股份有限公司 Design method of WADE service message architecture
CN104239301A (en) * 2013-06-06 2014-12-24 阿里巴巴集团控股有限公司 Data comparing method and device
CN104252486A (en) * 2013-06-28 2014-12-31 阿里巴巴集团控股有限公司 Data processing method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102724219B (en) * 2011-03-29 2015-06-03 国际商业机器公司 A network data computer processing method and a system thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794223A (en) * 2010-02-03 2010-08-04 南京联创科技集团股份有限公司 Design method of WADE service message architecture
CN104239301A (en) * 2013-06-06 2014-12-24 阿里巴巴集团控股有限公司 Data comparing method and device
CN104252486A (en) * 2013-06-28 2014-12-31 阿里巴巴集团控股有限公司 Data processing method and device

Also Published As

Publication number Publication date
CN108399151A (en) 2018-08-14

Similar Documents

Publication Publication Date Title
CN108399151B (en) Data comparison system and method
US10642725B2 (en) Automated test generation for multi-interface enterprise virtualization management environment
CN107729227B (en) Application program test range determining method, system, server and storage medium
US9250951B2 (en) Techniques for attesting data processing systems
US8978009B2 (en) Discovering whether new code is covered by tests
US10459832B2 (en) How to track operator behavior via metadata
US9612946B2 (en) Using linked data to determine package quality
US9959197B2 (en) Automated bug detection with virtual machine forking
US9009536B2 (en) Test case production utilizing problem reports
US10552306B2 (en) Automated test generation for multi-interface and multi-platform enterprise virtualization management environment
US9842044B2 (en) Commit sensitive tests
US20120173498A1 (en) Verifying Correctness of a Database System
US20120131560A1 (en) Virtual machine testing
US20200334358A1 (en) Method for detecting computer virus, computing device, and storage medium
CN110781090A (en) Control method and device for data processing test, computer equipment and storage medium
Arslan et al. Automatic performance analysis of cloud based load testing of web-application & its comparison with traditional load testing
CN112905370A (en) Topological graph generation method, anomaly detection method, device, equipment and storage medium
US11347533B2 (en) Enhanced virtual machine image management system
CN113419964A (en) Test case generation method and device, computer equipment and storage medium
US9355018B1 (en) History N-section for property location
US20170220450A1 (en) Analytic method and analyzing apparatus
US10552760B2 (en) Training set creation for classifying features of a system under agile development
US20240095347A1 (en) Detecting anomalies in distributed applications based on process data
US20240104085A1 (en) Computer system and method for evaluating integrity and parsing of a file system and parsing implementation
JP6949441B2 (en) Vector optimization device and vector optimization program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant