CN110109920B

CN110109920B - Data comparison method and server

Info

Publication number: CN110109920B
Application number: CN201910207220.4A
Authority: CN
Inventors: 贾仁庆; 任晓静; 丁以樵
Original assignee: MIGU Culture Technology Co Ltd
Current assignee: MIGU Culture Technology Co Ltd
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2022-03-22
Anticipated expiration: 2039-03-19
Also published as: CN110109920A

Abstract

The embodiment of the invention relates to the technical field of data processing, and discloses a data comparison method and a server. The data comparison method comprises the following steps: reading two data in parallel; wherein each of the data comprises a plurality of records and each of the records forms a hash node; attempting to insert each of the hash nodes into a hash tree; and in the process of trying to insert, if determining that the node matched with the hash node exists in the hash tree, rejecting the node. The embodiment of the invention also provides a server. The technical scheme of the embodiment of the invention can efficiently process comparison of large data volume and effectively solve the bottleneck problem of the storage capacity of the Hash table in the comparison scheme.

Description

Data comparison method and server

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a data comparison method and a server.

Background

In the process of office data synchronization, the situation that office data among systems are inconsistent often occurs due to various reasons, for example, content interface information office data is synchronized to an office data system by a subsidiary company and then synchronized to security control by the office data system, and a scene that interface data is inconsistent between the subsidiary company and the office data system and between the office data system and the security control often occurs, so that whether network element data are consistent needs to be compared.

At present, when comparing whether network element data are consistent or not, first comparison data are inserted into a first Hash table a, then second comparison data are read one by one, whether first data which are the same as currently read data exist in the first Hash table a or not is judged, if yes, the first same data are deleted from the first Hash table a, and if not, the current data are stored in a second Hash table B. After the data reading is finished, the first difference data is stored in the first Hash table A, and the second difference data is stored in the second Hash table B.

The inventor finds that at least the following problems exist in the prior art: before the comparison starts, first comparison data needs to be stored into a Hash table; under the condition that the data volume of the first comparison data is very large, due to the limitation of the capacity of the Hash table, a bottleneck can occur, and the processing speed rapidly slides down due to serious collision.

Disclosure of Invention

The embodiment of the invention aims to provide a data comparison method and a server, which can efficiently process comparison of large data volume and effectively solve the bottleneck problem of the storage capacity of a Hash table in a comparison scheme.

In order to solve the above technical problem, an embodiment of the present invention provides a data comparison method, including: reading two data in parallel; wherein each of the data comprises a plurality of records and each of the records forms a hash node; attempting to insert each of the hash nodes into a hash tree; and in the process of trying to insert, if determining that the node matched with the hash node exists in the hash tree, rejecting the node.

An embodiment of the present invention further provides a server, including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the data comparison method.

The embodiment of the invention also provides a computer readable storage medium, which stores a computer program, and the computer program is executed by a processor to realize the data comparison method.

Compared with the prior art, the embodiment of the invention reads two pieces of data in parallel, and eliminates the node when determining that the node matched with the hash node exists in the hash tree. The two data are read in parallel, so that the bottleneck problem caused by the capacity limitation of the hash tree, which is possibly caused by the fact that a complete data is required to be accommodated in the prior art, can be avoided; the nodes matched with the hash nodes are removed, namely, the records which are not needed in the data comparison are removed, and only the needed records are reserved, so that the storage space can be further saved; therefore, the embodiment of the invention can efficiently process comparison of large data volume and effectively solve the bottleneck problem of the storage capacity of the Hash table in the comparison scheme.

Additionally, said culling the node comprises: and if the node is a child node, selecting a leaf node of the node to replace the node and deleting the selected leaf node, and if the node is a leaf node, deleting the node. This embodiment provides a specific way of rejecting nodes.

Additionally, the node matching the hash node, comprising: the key words of the node and the hash node are the same, and the data sources are different. The present embodiment proposes a specific way of matching.

In addition, after determining that the node matching the hash node exists in the hash tree, the method further includes: judging whether the recorded contents of the node and the hash node are the same; if the recorded contents are different, outputting and storing the node and the hash node, and entering a step of eliminating the node; and if the recorded contents are the same, directly entering the step of removing the nodes. In this embodiment, records with different data sources, the same keywords, and different recording contents may also be identified, that is, a greater variety of difference data may be identified.

In addition, the hash tree is a prime number based hash tree. The data comparison method is realized based on the hash tree of prime numbers, and the data comparison method is larger in capacity and higher in comparison efficiency.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a flow chart of a data comparison method according to a first embodiment of the present invention;

figure 2 is a schematic diagram of a hash node insertion according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of a culling node according to a first embodiment of the invention;

FIG. 4 is a flowchart of a data comparison method according to a second embodiment of the present invention;

fig. 5 is a schematic diagram of a server according to a third embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

A first embodiment of the present invention relates to a data comparison method, including: reading two data in parallel; each piece of data comprises a plurality of records, and each record forms a hash node; attempting to insert each hash node into the hash tree; in the process of trying to insert, if the node which is matched with the hash node exists in the hash tree, the node is removed.

The implementation details of the data comparison method of the present embodiment are specifically described below, and the following description is only provided for the convenience of understanding, and is not necessary for implementing the present embodiment.

The data comparison method in this embodiment may be applied to a server, and the server is configured to perform consistency comparison on two pieces of data to identify difference data in the two pieces of data. For example, a sub-company synchronizes a piece of data a to the office data system, and in order to determine whether the data B received by the office data system is completely consistent with the data a sent by the sub-company, the data a in the sub-company is compared with the data B received by the office data system. Thus, the server obtains data a directly from the subsidiaries and data B directly from the office data system; the two data sets A, B were then aligned and the difference data between the two data sets A, B was obtained.

Fig. 1 is a flowchart of a data comparison method according to a first embodiment, which specifically includes the following steps:

step 101, reading two data in parallel.

The server reads the two pieces of data A, B in parallel and attempts to insert the two pieces of data A, B into the hash tree. Specifically, each piece of data includes a plurality of records, and each record forms a hash node. Taking one of the data a as an example, the server sequentially reads each record in the data a, and inserts each record (i.e., a hash node) into a hash tree. In the process of reading each record, each record forms a hash node, and each hash node comprises a keyword key, a record content word and a data source flag; wherein, flag represents which data, if record 1 is from data a, then the flag of record 1 is a; word represents the specific content of the record; the key may uniquely characterize the record, and if the word of the record is partner office data, the key may be a partner code. In this embodiment, when the server tries to insert each record, after one record is processed, the next record is processed.

In this embodiment, a hash tree is constructed according to a prime number resolution theorem, so as to implement the data comparison method described in this embodiment. The designer can select a plurality of prime numbers to construct the hash tree according to the requirement of storage capacity. In this embodiment, a prime number sequence PRIM ═ {3, 5, 7, 11, 13, 17, 19, 23, 29}, that is, 3 nodes are located under a root node of a hash tree, and the prime number sequence PRIM is used as a first layer, 5 nodes are located under each node in the first layer, and the prime number sequence PRIM is used as a second layer, and 7 nodes are located under each node in the second layer, and the third layer is used; and so on. According to the prime resolution theorem, the 9 prime numbers can contain 3 × 5 × 7 × 11 × 13 × 17 × 19 × 23 × 29, namely 30 hundred million data; in this embodiment, the 9 prime numbers are selected to construct a hash tree, which can meet the storage capacity requirement and the complexity. It should be noted that, in this embodiment, the type of the selected hash tree is not limited, and any hash tree type suitable for the data comparison method of the present application belongs to the protection scope of the present application.

The following steps 102-107 are the process of attempting to insert records (hash nodes) into the prime number based hash tree.

And 102, carrying out hash operation on the key words of the hash nodes and obtaining random numbers.

Taking one record in one data a as an example for explanation, the record forms a hash node, and performs hash operation on the key in the hash node to obtain a random number, so as to ensure that the insertion positions of the hash nodes into the hash tree meet or approximately meet random distribution. After the random number is calculated, the step of position search is performed, that is, the step 103 is performed.

And 103, determining a node position corresponding to the hash node according to the random number and the prime number corresponding to the current search layer in the hash tree.

Specifically, the random number obtained in step 102 is used to perform remainder operation on the prime number corresponding to the current search layer, and a remainder i is obtained, that is, the node position corresponding to the hash node is the ith node position of the current search layer.

Searching node positions in the hash tree corresponding to the hash nodes, wherein the node positions are searched from the root nodes of the hash tree; therefore, first, searching is started from the first layer, that is, the initial value of the current search layer is the first layer, and the prime number corresponding to the first layer is PRIM [0] ═ 3; that is, in the present search, the node position corresponding to the hash node is the ith child node position of the first layer.

Step 104, judging whether the node position is occupied or not; if not, go to step 105; if it is occupied, step 106 is entered.

If the node position corresponding to the hash node has not been inserted by other hash nodes, it indicates that the node position is not occupied, and at this time, the hash node may be inserted into the node position, that is, step 105; if the node position corresponding to the hash node has other hash nodes inserted, the node position is occupied.

Step 105, inserting the hash node into the node position.

Step 106, determining whether the node at the node position is matched with the Hash node; if yes, go to step 107; if not, go to step 108, and then return to step 103.

When the node position corresponding to the hash node is occupied, the server returns the node of the node position, wherein the node of the node position is also a hash node and comprises a key, a word and a flag; in this embodiment, the record to be inserted is referred to as a hash node, and the record that has been inserted into the node location is referred to as a node, only for distinguishing the two from each other in terms of name. And the server judges whether the hash node is matched with the node. In this embodiment, a node in the node location is matched with a hash node to be inserted, including that the key words of the node are the same as those of the hash node and the data sources are different. That is, when the key of the node is the same as the key of the hash node, and the flag of the node is different from the flag of the hash node, the node and the hash node can be regarded as being respectively from the same record of the two data, and the node is to be rejected at this time, that is, the step 107 is performed; and when the key of the node is different from the key of the hash node or the flag of the node is the same as the flag of the hash node, determining that the node is not matched with the hash node, and at the moment, continuously searching the node position corresponding to the hash node from the next layer of the hash tree.

It should be noted that, in this embodiment, the specific condition that a node in the node location matches a hash node to be inserted is merely an example, and is not limited thereto; those skilled in the art can set the matching conditions according to the actual data comparison requirement.

And step 107, removing nodes.

When the node and the hash node are determined to match, the node is removed from the hash tree. The mode of eliminating nodes in this embodiment includes: and if the node is a child node, selecting a leaf node of the node to replace the node and deleting the selected leaf node, and if the node is a leaf node, deleting the node.

And step 108, updating the current search layer according to a preset rule.

In the search, when the node at the node position is not matched with the hash node, the node position corresponding to the hash node needs to be continuously searched for by the next layer. The preset rule in this embodiment is to update the current search level to a level next to the current search level. If the current searching layer is the first layer, the updated current searching layer is the second layer; and if the current searching layer is the second layer, the updated current searching layer is the third layer, and so on. Step 108 is followed by a return to step 103.

Fig. 2 is a schematic diagram illustrating that when a node position corresponding to a hash node is not occupied, the hash node inserts into its corresponding node position. The process of attempting to insert record 1 and record 2 into the hash tree is shown in figure 2. Record 1, key 10, flag a, indicating that record 1 is from data a and key 10, record 2, key 1, flag B, indicating that record 2 is from data B and key 1; note that, since the data amount of the recorded content is large, it is not shown in fig. 2.

As indicated by the dotted line labeled record 1 in fig. 2, record 1 corresponds to node position P1 in the first layer of the hash tree having been occupied by a node (key 0, flag a) and record 1 does not match the node, so node position P12 in the second layer is searched again, record 1 corresponds to node position P12 in the second layer not being occupied, the representation labeled null in the figure being unoccupied; record 1 is inserted into the corresponding node position P12 in the second layer; similarly, as indicated by the dotted line labeled record 2 in fig. 2, record 2 corresponds to node position P3 in the first layer of the hash tree having been occupied by a node (key 1, flag B) and record 2 does not match the node, so node position P35 in the second layer is searched again, record 2 corresponds to node position P35 in the second layer is not occupied, so record 2 is inserted into the corresponding node position P35 in the second layer. In the example of fig. 2, record 2 corresponds to a node (key 1, flag B) at a node position P3 in the first layer of the hash tree, and its key and flag are the same as the hash node, indicating that there is a duplicate record in data B.

Fig. 3 is a schematic diagram illustrating that when a node position corresponding to a hash node is occupied and a node at the node position matches the hash node, the node is removed. The process of attempting to insert record 3 into the hash tree is shown in figure 3. In record 3, key 0 and flag B indicate that record 3 is from data B and key 0. As indicated by the dotted line labeled record 3 in fig. 3, record 3 corresponds to the node position P1 in the first layer of the hash tree having been occupied by a node (key 0, flag a), record 3 being the same as the key of the node and flag being different, indicating that record 3 matches the node, both are duplicate records, so the node (key 0, flag a) is to be removed; at this time, one leaf node1 of the node (key 0, flag a) is acquired, the node is replaced with the leaf node1, and the leaf node1 is deleted. It should be noted that, in this embodiment, the manner of removing a node is not limited at all, and in other examples, each node on the lower layer of the node may be sequentially replaced upwards; for example, in fig. 3, the node may be replaced with node2, node2 may be replaced with node1, and node1 may be deleted.

In this embodiment, in the process of inserting each record into the hash tree, records from different data and having the same keyword may be eliminated, that is, after all records are processed, the node stored in the hash tree is difference data in two sets of data, where the difference data is a record existing only in one set of data (existing only in data a, or existing only in data B); and traversing the hash tree to obtain the difference data in the two data.

Compared with the prior art, the embodiment of the invention reads two pieces of data in parallel, and eliminates the node when determining that the node matched with the hash node exists in the hash tree. The two data are read in parallel, so that the bottleneck problem caused by the capacity limitation of the hash tree, which is possibly caused by the fact that a complete data is required to be accommodated in the prior art, can be avoided, and the bottleneck problem caused by the capacity limitation of the hash tree is particularly serious when the data volume is large; in addition, the nodes matched with the hash nodes are removed, namely, the records which are not needed in the data comparison are removed, and only the needed records are reserved, so that the storage space can be further saved; therefore, the embodiment of the invention can efficiently process comparison of large data volume and effectively solve the bottleneck problem of the storage capacity of the Hash table in the comparison scheme.

A second embodiment of the present invention relates to a data alignment method. The second embodiment is substantially the same as the first embodiment, and mainly differs therefrom in that: in the second embodiment of the present invention, it is also possible to identify nodes having the same keyword but different recorded contents.

Fig. 4 is a flowchart of a data comparison method according to a second embodiment of the present invention, which includes the following steps:

step 201, reading two data in parallel; each piece of data comprises a plurality of records, and each record forms a hash node;

step 202, carrying out hash operation on the key words of the hash node to obtain a random number, and entering a position searching step;

step 203, determining a node position corresponding to the hash node according to the random number and the prime number corresponding to the current search layer in the hash tree;

step 204, judging whether the node position is occupied or not; if not, go to step 105; if so, go to step 106;

step 205, insert hash node into node location

Step 206, determining whether the node at the node position is matched with the hash node; if yes, go to step 207; if not, go to step 210 and then return to step 203;

when the node with the determined node position is matched with the hash node, the node matched with the hash node in the hash tree is determined to exist.

Step 207, judging whether the recorded contents of the node and the hash node are the same; if not, go to step 208; if yes, go to step 209;

step 208, outputting and storing the nodes and the hash nodes; and proceeds to step 209.

And step 209, removing nodes.

And step 210, updating the current search layer according to a preset rule.

Wherein, steps 201 to 206, 209 to 210 are substantially the same as steps 101 to 108 in the first embodiment, and are not described herein again; except that steps 207, 208 are also included.

When the node at the node position is judged to be matched with the hash node in the step 206, whether the record contents of the node and the hash node are the same or not is judged; if the recorded contents are the same, which means that the node is completely consistent with the hash node, directly entering step 209; if the record contents are different, which means that the node is not completely consistent with the hash node, step 209 is entered, and the node and the hash node are output and stored, and may be stored in a preset difference data table, where the difference data table is specially used for storing records with different flags, the same key, and different words. In this embodiment, words in records with the same key are different, which may be caused by various uncertain factors in network transmission, such as some parts of recorded contents are lost or changed in the process of synchronizing data from a subsidiary to an office data system.

Compared with the first embodiment, the present embodiment can also recognize records with different data sources, the same keywords, and different recording contents, that is, can recognize a greater variety of difference data.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A third embodiment of the present invention relates to a server, as shown in fig. 5, including:

at least one processor 501; and the number of the first and second groups,

a memory 502 communicatively coupled to the at least one processor 501; wherein the content of the first and second substances,

the memory 502 stores instructions executable by the at least one processor 501 to enable the at least one processor 501 to perform the above-described method embodiments.

The server may further include: an input device and an output device.

The memory 502 and the processor 501 are coupled by a bus, which may include any number of interconnected buses and bridges that couple one or more of the various circuits of the processor 501 and the memory 502 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 501 is transmitted over a wireless medium through an antenna, which further receives the data and transmits the data to the processor 501.

The processor 501 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 502 may be used to store data used by processor 501 in performing operations.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to method embodiments provided in the embodiments of the present application.

A fourth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A method of data alignment, comprising:

reading two data in parallel; wherein each of the data comprises a plurality of records and each of the records forms a hash node;

attempting to insert each of the hash nodes into a hash tree; in the process of trying to insert, if the node matched with the hash node exists in the hash tree, the node is removed;

wherein the node matched with the hash node comprises: the nodes are the same as the keywords of the hash nodes and have different data sources;

after determining that the node matching the hash node exists in the hash tree, the method further includes:

judging whether the recorded contents of the node and the hash node are the same; if the recorded contents are different, outputting and storing the node and the hash node, and entering a step of eliminating the node; if the recorded contents are the same, directly entering the step of removing the nodes;

the eliminating the node comprises: and if the node is a child node, selecting a leaf node of the node to replace the node and deleting the selected leaf node, and if the node is a leaf node, deleting the node.

2. The data comparison method of claim 1, wherein the hash tree is a prime number based hash tree.

3. The method of claim 2, wherein the trial insertion process comprises:

carrying out Hash operation on the keywords of the Hash nodes to obtain random numbers, and entering a position searching step;

the position searching step is that the node position corresponding to the hash node is determined according to the random number and the prime number corresponding to the current searching layer in the hash tree;

inserting the hash node into the node location when the node location is unoccupied;

when the node position is occupied, if the node of the node position is matched with the hash node, determining that the node matched with the hash node exists in the hash tree; and if the node at the node position is not matched with the Hash node, updating the current search layer according to a preset rule, and entering the position search step.

4. The method of claim 2, wherein the prime number sequences used to construct the hash tree are 3, 5, 7, 11, 13, 17, 19, 23, 29.

5. A server, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data alignment method of any one of claims 1 to 4.

6. A computer-readable storage medium storing a computer program, wherein the computer program is configured to implement the data matching method according to any one of claims 1 to 4 when executed by a processor.