WO2018121025A1

WO2018121025A1 - Method and system for comparing data of data table

Info

Publication number: WO2018121025A1
Application number: PCT/CN2017/108196
Authority: WO
Inventors: 崔鑫; 杨磊; 蔺若林
Original assignee: 华为技术有限公司
Priority date: 2016-12-30
Filing date: 2017-10-28
Publication date: 2018-07-05
Also published as: CN107070645A; CN107070645B

Abstract

Provided in the present application are a method and a system for comparing data of a data table. The system comprises a client and a plurality of servers. A first database corresponds to at least one first server, and a second database corresponds to at least one second server. The client acquires first metadata and second metadata of a target data table in the two databases, the first metadata comprising a first range corresponding to the data of the target data table, the second metadata comprising a second range corresponding to the data of the target data table. The client determines a target range according to at least one of the first range and the second range. The first server signs, according to the target range, the data of the target data table in the first database to obtain a first signature; similarly, the second server obtains a second signature. The client determines, according to the first signature and the second signature, whether the data of the target data table in the two databases is identical, avoiding excessive data transmission and comparison. The invention has the advantages of high operation speed, low cost and low network resources occupancy.

Description

Method and system for comparing data of data tables

Technical field

The present application relates to the field of databases and, more particularly, to a method and system for comparing data of a data table.

Background technique

For big data, the key-value database is the best choice for dealing with a large number of random writes and random read scenes. All data in the key-value database exists in the form of key-value. The key-value form has a strictly defined structure, and all data in the database exists in the underlying file system as unreversible files. The new data is written, a new key-value is generated; the old data is rewritten or deleted, and a new key-value is generated to mark the rewrite or delete.

In addition, in order to pursue higher data availability and better disaster tolerance, the big data field usually takes the function of backing up data offsite in multiple data center solutions. Therefore, verifying the consistency of data before, during and after backup data has become an important feature in the field of big data storage.

Existing comparison tools are data-based comparison tools. When using the comparison tool to compare the data of two databases (working database and backup database) (the structure of the data tables in the two databases should be the same), the comparison tool will parallelize the verification task. For example, submitting a MapReduce (MR) job is distributed to many nodes for parallel execution. The comparison tool reads data from the data tables of the two databases and compares them to obtain inconsistent data.

The existing comparison tool compares the data in the data table line by line, the comparison efficiency is low, and the comparison tool runs slowly. In addition, the existing comparison technology requires the mapping framework to communicate with multiple servers of the cluster of the local database locally, and may also need to communicate with the server of the cluster of the remote database, which consumes a large amount of network resources.

Summary of the invention

The present application provides a method and system for comparing data of a data table, which can avoid a large amount of data transmission and comparison, has a fast running speed and low cost, and has a small amount of network resources.

A first aspect of the present application provides a method of comparing data of a data table, the method being applied to a system for comparing data of a target data table of a first database and a second database, the system comprising a client and a plurality of servers, wherein the first database corresponds to at least one first server, and the second database corresponds to at least one second server, the method comprising: the client acquiring the first database Decoding first metadata of the target data table and second metadata of the target data table in the second database, wherein the first metadata includes data of the target data table in a server of the first database a first range corresponding to the second range, wherein the second metadata includes a second range corresponding to the data of the target data table in a server of the second database; the client is according to the first range and Determining a target range by at least one of the second ranges; and the data of the target data table in the first database according to the target range by the at least one first server A first signature is a signature line; the at least one second server according to the target range, the second number And signing according to the data of the target data table in the library to obtain a second signature; the client determining, according to the first signature and the second signature, data of the target data table in the first database and the second database Is the data in the target data table the same?

In the first aspect of the method for comparing the data of the data table, the client determines the target range according to the distribution of the data of the data table, and the server signs the data according to the target range, and the client compares the signature corresponding to the data of the data table in the two databases. Consistently, it can be judged whether the data of the two data tables are consistent, avoiding a large amount of data transmission and comparison, and the running speed is fast and the cost is low, and the network resource occupation amount is small.

In a possible implementation manner of the first aspect, each server of the first database corresponds to a first server, and the first range includes data of the target data table in each of the first databases. a sub-scope of the server, each server of the second database corresponds to a second server, and the second range includes a sub-range of data of the target data table in each server of the second database, Determining, by the client, the target range according to at least one of the first range and the second range, including: the client is in the server of each server of the first database according to the data of the target data table. The range and the data of the target data table are in a sub-range of each server of the second database, determining a sub-range of the target range, and data corresponding to each of the sub-ranges is distributed in the first database On one server, and distributed on one server in the second database. In this implementation manner, when data is subsequently signed, data transmission across servers (cross-RS) is no longer required, which can further improve the running speed and reduce the occupation of network resources.

In a possible implementation manner of the first aspect, the data of the target data table in the first database is signed by the at least one first server according to the target range to obtain a first signature, where the at least one Before the second server sends the data of the target data table in the second database to obtain the second signature according to the target range, the method further includes: the client, the at least one first server Performing a tree segmentation for each of the sub-ranges with at least one of the at least one second server; the at least one first server is configured to target data in the first database according to the target range The data of the table is signed to obtain the first signature, including: the at least one first server signs the segment of the data of the target data table in the first database according to the tree segment to obtain a tree type Decoding a first signature; the at least one second server, according to the target range, signing data of the target data table in the second database to obtain a second signature, including: At least one second server based on the tree segment, the second segment of data in the target database data table with the second signature is a signature of a tree. This implementation method can perform tree segmentation on the sub-range of the target range to obtain a more detailed signature, which can improve the efficiency when comparing signatures.

In a possible implementation manner of the first aspect, the client determines, according to the first signature and the second signature, data of a target data table in the first database and a target in the second database Whether the data of the data table is the same, including: the client determining the same layer of the first signature and the second signature tree according to the first signature of the tree type and the second signature of the tree type Whether the signatures are consistent. When the signatures are inconsistent, it is determined that the data of the target data table in the first database in the segment corresponding to the layer is different from the data in the target data table in the second database.

In a possible implementation manner of the first aspect, at least one of the client, the at least one first server, and the at least one second server performs a tree type for each of the sub-ranges Segmentation, comprising: the at least one first server and the at least one second server counting statistics on density of data in the target range; the at least one first server and the at least one second service According to the statistical results, for each of the children The range is tree segmented. This implementation can make each server load more balanced.

In a possible implementation manner of the first aspect, the at least one first server, according to the target range, signatures data of the target data table in the first database to obtain a first signature, including: And the at least one first server performs a first signature on the data of the target data table in the first database by using a hash algorithm according to the target range; and the at least one second server is configured according to the target range. The data of the target data table in the second database is signed to obtain a second signature, and the method includes: the at least one second server, according to the target range, data of the target data table in the second database by using a hash algorithm Sign the signature to get the second signature.

A second aspect of the present application provides a system for comparing data of a data table, wherein the system is configured to compare data of a target data table of a first database and a second database, the system including a computing device running a client And running a plurality of servers of the server, wherein the first database comprises at least one first server running a first server, and the second database comprises at least one second server running a second server: the computing device is for Acquiring first metadata of the target data table in the first database and second metadata of the target data table in the second database, where the first metadata includes data of the target data table a first range corresponding to the server in the first database, where the second metadata includes a second range corresponding to data of the target data table in a server of the second database; the calculating The device is further configured to determine a target range according to at least one of the first range and the second range; the at least one first server And signing, according to the target range, data of the target data table in the first database to obtain a first signature; the at least one second server is configured to: target the target data table in the second database according to the target range The data is signed to obtain a second signature; the computing device is further configured to determine data of the target data table in the first database and target data in the second database according to the first signature and the second signature Whether the data of the table is the same.

In a possible implementation manner of the second aspect, each server in the first database for storing the target data table is the first server running the first server, the first The range includes data of the target data table in a sub-range of each of the first servers of the first database, and each server in the second database for storing the target data table is running The second server of the second server, the second range includes a sub-range of the data of the target data table in each of the second servers of the second database, where the computing device is specifically configured to: The data of the target data table is in a sub-range of each of the first servers of the first database and data of the target data table is in a sub-range of each of the second servers of the second database, Determining a sub-range of the target range, data corresponding to each of the sub-ranges is distributed on one server in the first database, and distributed on one server in the second database.

In a possible implementation manner of the second aspect, the first server signs, according to the target range, data of a target data table in the first database to obtain a first signature, where the second server is configured according to The target range, at least one of the computing device, the at least one first server, and the at least one second server before signing data of the target data table in the second database to obtain a second signature For performing tree segmentation for each of the sub-ranges; the at least one first server is specifically configured to: perform segmentation of data of the target data table in the first database according to the tree segment Signing the first signature of the tree type; the at least one second server is specifically configured to: sign the segment of the data of the target data table in the second database according to the tree segment to obtain a tree type The second signature.

In a possible implementation manner of the second aspect, the computing device is specifically configured to: determine, according to the first signature of a tree type and the second signature of a tree, the first signature and the first Whether the signatures of the same layer of the two signed trees are consistent. When the signatures are inconsistent, it is determined that the data of the target data table in the first database is different from the data of the target data table in the second database. .

In a possible implementation manner of the second aspect, the at least one first server and the at least one second server are configured to perform statistics on density of data in the target range; the at least one first server and The at least one second server is configured to perform tree segmentation for each of the sub-ranges according to a statistical result.

In a possible implementation manner of the second aspect, the at least one first server is specifically configured to: according to the target range, sign the data of the target data table in the first database by using a hash algorithm a signature; the at least one second server is specifically configured to: according to the target range, sign the data of the target data table in the second database by using a hash algorithm to obtain a second signature.

The third aspect of the present application provides a storage medium in which a program is stored, and when the program is run by a computing device and a server, the computing device and the server perform the foregoing first aspect or any implementation of the first aspect. The method of comparing the data of the data table. The storage medium includes, but is not limited to, a read only memory, a random access memory, a flash memory, an HDD, or an SSD.

A fourth aspect of the present application provides a computer program product comprising program instructions for performing the foregoing first aspect or first aspect when the computer program product is executed by a computing device and a server An implementation provides a method of comparing data of a data table. The computer program product can be a software installation package, and in the case of the method of comparing the data of the data table provided by any of the foregoing first aspect or the first aspect, the computer program product can be downloaded and used in the computing device And execute the computer program product on the server.

DRAWINGS

1 is a schematic diagram of a method of comparing data of a data table using a comparison tool.

2 is a schematic block diagram of a system for comparing data of a data table in accordance with one embodiment of the present invention.

3 is a schematic block diagram of a system for comparing data of a data table in accordance with another embodiment of the present invention.

4 is a schematic flow chart of a method of comparing data of a data table according to an embodiment of the present invention.

Figure 5 is a schematic illustration of the segmentation target range of one embodiment of the present invention.

Figure 6 is a schematic illustration of a segmentation target range in accordance with another embodiment of the present invention.

Figure 7 is a schematic illustration of a segmentation target range in accordance with another embodiment of the present invention.

Figure 8 is a schematic illustration of a segmentation target range in accordance with another embodiment of the present invention.

Figure 9 is a schematic illustration of a segmentation target range in accordance with another embodiment of the present invention.

Figure 10 is a schematic illustration of the results of the segmentation of the target range of one embodiment of the present invention.

11 is a schematic diagram of a tree-type signature in accordance with an embodiment of the present invention.

Figure 12 is a schematic block diagram of a computing device or server in accordance with one embodiment of the present invention.

detailed description

The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings.

For verifying the consistency of data in two databases, the existing comparison tool is a data-based comparison tool. when When comparing the contents of the data tables of the two databases using the comparison tool, the comparison tool parallelizes the verification tasks.

The following is a description of the process of comparing the data of the data tables in the database with the Hadoop database (Hbase) and the existing comparison tool as an example. 1 is a schematic diagram of a method 100 of a prior comparison tool comparing data of a data table. The method 100 includes:

S110. The existing comparison tool submits an MR job to the Hbase cluster corresponding to the database of the data center (DC) 1.

S120. The remote controller (RM) of the Hbase cluster distributes the MR job to many nodes for parallel execution, that is, assigns the MR job to multiple map tasks.

S130, each map task is responsible for comparing a part of the data. Each map task reads data from the HBase clusters of the two data centers DC1 and DC2, then compares the data and prints inconsistent data. Typically, each server in the HBase cluster is configured with a Service Area Server (RS), which is used to manage tasks running on the server.

The existing comparison tool compares the data in the data table line by line, the comparison efficiency is low, and the comparison tool runs slowly. Secondly, the existing comparison tool not only requires the participation of two HBase clusters, but also requires the cluster to provide the running nodes of the RM jobs, and the comparison tools occupy and operate at a higher cost. In addition, the existing comparison technology requires the mapping framework to communicate with the RSs of multiple servers of the HBase cluster of the local database locally, and may also need to communicate with the RS of the server of the HBase cluster of the remote database, which takes up a large amount of Internet resources.

Based on the above problem, an embodiment of the present invention provides a method for comparing data of a data table. 2 shows a schematic block diagram of a system 200 for comparing data of a data table in accordance with an embodiment of the present invention. It should be understood that the system 200 illustrated in Figure 2 is a schematic block diagram of the perspective of software. As shown in FIG. 2, the system 200 includes a client 210 and a plurality of servers from a software perspective, wherein each database corresponds to at least one server, and the first database corresponds to at least one first server 221, and the second database corresponds to At least one second server 222.

FIG. 3 shows a schematic block diagram of a system 300 for comparing data of a data table in accordance with an embodiment of the present invention. It should be understood that the system 300 illustrated in Figure 3 is a schematic block diagram of the perspective of hardware. Corresponding to the software of FIG. 2, system 300 includes a computing device 310 running a client and a plurality of servers running a server. The client 210 can be deployed on the user's computing device 310. The computing device 310 is not usually a server corresponding to any database, that is, a server that is not normally a DC. The first server 221 can be deployed in the first DC corresponding to the first database. A server 321 can be deployed on the second server 322 of the second DC corresponding to the second database. Optionally, a first server 221 may be deployed on each server of the first database for storing the data table, that is, the server deploying the first server 221 is considered to be the first server 321; and the second database is configured to store data. A second server 222 can be deployed on each server of the table, that is, the server deploying the second server 222 is considered to be the second server 322. Of course, a plurality of servers in each database may share a server, which is not limited in this embodiment of the present invention. The number of the first server and the second server shown in FIG. 2, and the number of the first server and the second server shown in FIG. 3 are only schematic, and are not intended to limit the embodiments of the present invention.

In addition, in the embodiment of the present invention, metadata is acquired, and the metadata is generally stored in a meta table, and the meta table is usually stored in another database other than the server that stores the data table in the database. The meta table of the first database is schematically shown in FIG. 3 and stored on the third server 323 of the first database, and the meta table of the second database is stored on the fourth server 324 of the second database. Of course, the meta table can also be stored. The embodiment of the present invention does not limit this on the server that stores the data table in the database.

It should be understood that the computing device and server in system 300 can be considered a node. The server for storing the data table (for example, the first server and the second server) may be regarded as a storage node, and the server is deployed on the storage node, and the server may be part of the function of the RS, or may exist independently with the RS. The server that stores the meta table can be considered a metadata management node.

It should be understood that the server of the embodiment of the present invention may be used as a function module of the RS, or may be a separate module or unit, which is not limited by the embodiment of the present invention.

4 is a schematic flow diagram of a method 400 of comparing data of a data table in accordance with an embodiment of the present invention. As shown in FIG. 4, method 400 includes:

S410, the client 210 acquires the first metadata of the target data table in the first database and the second metadata of the target data table in the second database, where the data including the target data table in the first metadata is in the server of the first database. Corresponding first range, the second metadata includes a second range corresponding to the data of the target data table in the server of the second database;

S420. The client 210 determines a target range according to at least one of the first range and the second range.

S430, the at least one first server 221 signs the data of the target data table in the first database according to the target range to obtain a first signature;

S440. The at least one second server 222 signs the data of the target data table in the second database according to the target range to obtain a second signature.

S450. The client 210 determines, according to the first signature and the second signature, whether data of the target data table in the first database is the same as data of the target data table in the second database.

In the method of the embodiment of the present invention, the client determines the target range according to the data distribution of the data table, and the server signs the data according to the target range, and the client compares whether the signatures corresponding to the data of the data tables in the two databases are consistent. Whether the data of the two data tables are consistent, avoiding a large amount of data transmission and comparison, the operation speed is fast and the cost is low, and the network resource occupation amount is small.

Specifically, the first database and the second database where the target data table to be compared in the embodiment of the present invention are located belong to different databases, and the two databases may further belong to clusters of servers of different data centers. Of course, the two databases may belong to the same data center, which is not limited by the embodiment of the present invention.

Generally, the data table in the database is large, and it is generally required to divide the data table horizontally and store it on multiple servers to enhance the speed of concurrent processing.

In S410, the client 210 communicates with the server storing the first database and the second database of the target data table, respectively, to obtain the first metadata of the target data table in the first database and the second data of the target data table in the second database. Metadata. Meta data is generally stored in a meta table. The meta table is usually stored in a database other than the server storing the data table. Of course, the meta table can also be stored in a database in the database for storing the data table. This embodiment of the present invention does not limit this.

The client 210 obtains two corresponding meta tables of the target data tables of the two databases, that is, obtains the first metadata and the second metadata. It is assumed that each database includes three servers, one RS is run on each server, and each RS corresponds to a region in which the target data table is stored. According to the first metadata and the second metadata, a range distribution corresponding to each region is obtained, that is, a start key and an end key. Wherein, the data including the target data table in the first metadata is in the server of the first database Corresponding the first range, the second metadata includes the second range corresponding to the data of the target data table in the server of the second database. In a specific example, the distribution of the target data table table1 can be as shown in Table 1.

Table 1 Distribution of target data tables

The target data table of the first database has a key range of 1-30 on the RS1 of the first database, a key range of 31-80 on the RS2 of the first database, and a range of keys on the RS3 of the first database. It is 81-100. The target data table of the second database has a key range of 1-25 on RS1 of the second database, a key range of 26-60 on RS2 of the second database, and a range of keys on RS3 of the second database. It is 61-100.

In S320, the client 210 determines the target range according to at least one of the first range and the second range.

Optionally, the distribution in the above example is consistent: each server of the first database corresponds to a first server 221, and the first range includes a sub-range of data of the target data table in each server of the first database, and the second database Each server corresponds to a second server 222, and the second range includes data of the target data table in a sub-scope of each server of the second database. The client 210 determines the target range according to at least one of the first range and the second range, and may include: the client 210 according to the data of the target data table, the sub-range of each server of the first database, and the target data table according to the data of the target data table. The data is in a sub-range of each server of the second database, and the sub-range of the target range is determined. The data corresponding to each sub-range is distributed on one server in the first database and distributed on one server in the second database.

Specifically, the client 210 may perform a segmentation of the maximum matching target of the repetition range according to the first range and the second range (ie, the distribution of the start key and the end key) corresponding to the two data tables, respectively, to obtain a target range. The target range includes a plurality of sub-ranges, and the data corresponding to each sub-range is distributed on one server in the first database and distributed on one server in the second database. In this way, when the data is signed, the data transmission between the servers (cross-RS) is no longer required, which can further improve the running speed and reduce the occupation of network resources.

A scheme for dividing the sub-range of the target range is described in detail below. This scheme not only makes the sub-range of the target range distributed on one server in the first database, but also distributes it on one server in the second database; and it also ensures that the number of sub-ranges divided is the least. The specific steps of the segmentation can be as follows.

Step 1. The client 210 forms two region queues by distributing the target data tables of the two databases on the server in a descending order according to row keys. The first range corresponds to the region queue A (A1, A2, ...), and the second range corresponds to the region queue B (B1, B2, ...). The client 210 sequentially selects regions from the two region queues.

Step 2. The client 210 compares the ranges of the selected two regions (for example, Ax and By) to see if the two regions overlap. Here are divided into several situations:

a) If the two regions do not overlap, the start key smaller region is output as the already segmented region (ie, a sub-range of the target range), and then the region is removed from the region queue in which the region with the smaller start key is located. Region, then continue to repeat step 2 and continue the comparison.

b) If the two regions overlap, they can be divided into several cases:

I. Fully overlapping situations:

As shown in FIG. 5, when two regions (A1 and B1) completely overlap, any one of the regions is output as the already-divided region C1 (ie, a sub-range of the target range), and then from the two region queues. Take the next region separately, and then repeat the operation of step 2 to continue the comparison.

II. Partial overlap (same start key, different end key):

As shown in FIG. 6, when the two regions (A1 and B1) partially overlap, the overlapping portion is intercepted and output as the already-divided region C1 (ie, a sub-range of the target range). B1 is intercepted, and the remaining part region B1- is compared with the next region A2 of the region queue A as a new region.

III. Partial overlap (the start key is different, the end key is also different, and one region contains another region):

As shown in FIG. 7, when region A1 is completely included in region B1, region B1 is segmented by start key and end key of region A1, and C1, C2, and B1-(the remaining portion of region B1) are obtained. C1 and C2 (subranges of the target range, respectively) are saved as the result of the segmentation, and the next region A2 of B1- and region queue A is taken as the two regions to be compared, and the comparison of step 2 is performed.

IV. Partial overlap (the start key is different, the end key is also different, there is no case where one region contains another region):

As shown in Figure 8, the start key of region B1 is smaller than the start key of region A1, and the end key of region B1 is also smaller than the end key of region A1. The start key of region A1 and the end key of region B1 are used as the segmentation criteria. A1 and region B1 are segmented. The first two regions C1 and region C2 (the sub-ranges of the target range respectively) obtained after the segmentation are output as the result, and the remaining region A1 of the region A1 and the next region B2 of the region queue B are regarded as two to be compared. Region performs a comparison of step 2.

V. Partial overlap (start key is different, end key is the same):

In the example shown in FIG. 9, the start key of region A1 is used as a segmentation criterion, and region A1 and region B1 are segmented. After segmentation, two regions C1 and region C2 (subranges of the target range respectively) are obtained as the segmentation result output, and then the next region A2 of the region queue A and the next region B2 of the region queue B are taken as the two to be compared. The regions are compared in step 2.

Step 3. The client 210 sequentially reads the region in the first range and the region in the second range corresponding to the target data table of the two databases until the division is completed.

The result of dividing the region in the first range of the target data table and the region in the second range in the example shown in Table 1 is as shown in FIG. The target range includes 5 sub-ranges, and each sub-range is distributed on one RS whether in the first database or the second database, and does not cross the RS.

Optionally, in S320, the client 210 may also use one of the first range and the second range as the target range. The specific manner of dividing the target range is not limited in the embodiment of the present invention.

After determining the target range, each sub-range of the above target range can be directly used as the finest granularity, and the data of the target data table in the two databases is signed by the server.

Optionally, as an embodiment, at least one first server at S330, according to the target range, signatures data of the target data table in the first database to obtain a first signature, and S340 at least one second server The method 300 may further include: in the client, the at least one first server, and the at least one second server, before the data of the target data table in the second database is signed to obtain the second signature. At least one tree segmentation is performed for each sub-range; and the S330, at least one first server, signs the data of the target data table in the first database to obtain the first signature according to the target range, and may include: at least one first server according to the at least one first server a tree segment segmenting, signing a segment of the data of the target data table in the first database to obtain a first signature of the tree type; S340, at least one second server terminal performing data on the target data table in the second database according to the target range The signing of the second signature may include: at least one second server signing the segment of the data of the target data table in the second database according to the tree segment to obtain the second signature of the tree. In this way, tree segmentation of the sub-range of the target range can result in a more detailed signature, which can improve the efficiency of comparing signatures.

The process of tree segmentation for each sub-range is described below in conjunction with a specific embodiment. In this embodiment, at least one of the client, the at least one first server, and the at least one second server performs tree segmentation for each sub-range, including: at least one first server and at least one second service The terminal performs statistics on the density of the data in the target range; at least one first server and at least one second server perform tree segmentation for each sub-range according to the statistical result.

Specifically, the client 210 encapsulates the information of the sub-range of the segmented target range into a request for the statistical count and sends it to the server of the two databases. Because the data structure of the target data table in the two databases to be compared is the same, it is only necessary for each sub-range to perform statistical counting on the server of any one of the two databases. In one embodiment of the invention, a load balancing operation is performed on servers in two databases. As shown in Table 2, the sub-range [0-25] is assigned to the second server of the second database (corresponding to RS1) to count the density, and the sub-range [26-30] is assigned to the first service of the first database. The end (corresponding to RS1) is used to count the density. The sub-range [81-100] can be assigned to either the first server of the first database or the second server of the second database. In this way, no RS is idle, and no RS is too busy, which can balance the load of each server.

Of course, in other embodiments of the present invention, the load balancing of each server may be disregarded, and the client 210 may select the server of any one of the two databases to count the data density; or the client 210 may access the two databases. In the embodiment of the present invention, a database is selected, and the statistical data density is used by the server of the selected database.

Table 2 shows the density statistics

目标范围的子范围Subrange of target range	第一数据库First database	第二数据库Second database
1-251-25	等待wait	统计密度(RS1)Statistical density (RS1)
26-3026-30	统计密度(RS1)Statistical density (RS1)	等待wait
31-5831-58	等待wait	统计密度(RS2)Statistical density (RS2)
59-8059-80	统计密度(RS2)Statistical density (RS2)	等待wait
81-10081-100	等待wait	统计密度(RS3)Statistical density (RS3)

According to Table 2, the RS2 statistics of the second database obtain the density of the data in the sub-range [31-58], and the sub-range is segmented, and the sub-range [31-58] is divided into trees with two branches per layer. The shape, the lowest layer of the tree (ie, the finest segments) are [31-37][38-44][45-51][52-58]. The RS2 of the second database encapsulates the information and sends it to the RS2 of the first database. The format may be "start key, end key, least size, child size" as follows, and the value is "31, 58, 7, 2". After receiving the information, the RS2 of the first database obtains the information of the tree grouping. The second server of the first database (corresponding to RS2) reads the data according to the tree segment, and signs the segment of the data of the target data table in the first database to obtain the first signature of the tree.

It should be understood that, in the embodiment of the present invention, reading data is a link that takes a long time. Therefore, the RS2 of the second database can complete the signature while counting the density of the data in the sub-range of the target range.

The process of signing the data segment by the server according to the tree segmentation to obtain the tree signature can be as follows. The server performs a signature operation on each of the lowest-level segments of each sub-range tree, and then performs a bottom-up tree construction operation according to the branches of the tree. Figure 11 is a diagram showing the creation of a tree-type signature in accordance with one embodiment of the present invention.

Step a. First establish the signature of the most fine-grained segmented data. For example, v1=[31-37], v2=[38-44], v3=[45-51], v4=[52-58].

Step b. According to the setting of the branch of the tree to 2, the signature of the upper layer is established. For example, v5=[31-44]=signature (v1, v2), v6=[45-58]=signature (v3, v4).

Step c. If the number of signatures of the layer is not 1, repeat step b; if the number of signatures of the layer is 1, the process ends. Finally get the signature of the top layer v7 = [31-58] = signature (v5, v6).

Optionally, the embodiment of the present invention uses a hash algorithm to sign the data. For example, the data may be signed by the Message Digest Algorithm 5 (MD5). Correspondingly, the at least one first server of the S330 signs the data of the target data table in the first database according to the target range to obtain the first signature, which may include: at least one first server is configured by the hash algorithm according to the target range. The data of the target data table in a database is signed to obtain a first signature; S340: at least one second server signs the data of the target data table in the second database according to the target range to obtain a second signature, which may include: at least one second The server signs the data of the target data table in the second database by the hash algorithm according to the target range to obtain the second signature.

After each server is signed, the first signature of the tree or the second signature of the tree can be fed back to the client 210. It should be understood that each sub-range in the embodiment of the present invention corresponds to a tree-shaped signature, so there may be multiple first signatures and multiple second signatures. Each server may also feed back only the signature of the highest layer of the first signature of the tree or the signature of the highest layer of the second signature of the tree to the client 210. When the signatures of the highest layer are inconsistent, the signature of the lower layer is sent to the client 210 for comparison, which is not limited by the embodiment of the present invention.

Client 210 receives signatures for sub-ranges of target ranges from both databases. The client 210 compares the signatures. If the signatures of the highest layer are equal, it is considered that the contents of the target data tables in the two databases are consistent, and the comparison ends.

If the client 210 finds that the signatures of the highest layer are not equal, the signatures of the lower layers are compared in turn until the most fine-grained segments with inconsistent signatures are found, and it is determined which data is inconsistent. Alternatively, if the client 210 finds that the signatures of the highest layer are not equal, the server is required to return the signature of the next layer, and the client 210 continues to compare the returned signatures. If any of the signatures are found to be inconsistent, the server is required to continue to return to the next layer. Sign until you find the signature is inconsistent The finest-grained segmentation.

In summary, the S350 client determines, according to the first signature and the second signature, whether the data of the target data table in the first database is the same as the data of the target data table in the second database, and may include: the client first according to the tree type The signature and the second signature of the tree determine whether the signatures of the same layer of the first signature and the second signature are consistent. When the signatures are inconsistent, the data of the target data table in the first database is determined by the segment corresponding to the layer. The data of the target data table in the second database is different.

The client 210 can perform a small-range query on the target data table of the two databases according to the most fine-grained segmentation in which the signatures are inconsistent, and the read data is compared in the client 210 by string comparison, that is, Detailed data sheet differences can be obtained. The embodiment of the present invention may not be used for detailed comparison, and only the data of the target data table is consistent, which is not limited by the embodiment of the present invention.

FIG. 12 shows a schematic block diagram of an apparatus 500 in accordance with an embodiment of the present invention, which may correspond to any of the computing devices or servers referred to in FIG. 3 of an embodiment of the present invention. As shown in FIG. 12, device 500 can include a processor 510, a memory 520, and a network interface 530. The processor 510 can be used to execute the method of the embodiment of the present invention, the memory 520 can be used to store code executed by the processor 510, and the network interface 530 is used to communicate with other devices. The computing device 310 of FIG. 3 can also include an output device or an output interface coupled to the output device for outputting a comparison result. Output devices can include displays, printers, and the like. The processor, memory and network interface in device 500 can communicate with one another via internal connection paths to communicate control and/or data signals.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present invention.

A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the system embodiment described above is merely illustrative. For example, the division of the unit is only a logical function division, and the actual implementation may have another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

The functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product. Based on this understanding, the technical solution of the present invention is essentially or The portion that contributes to the prior art or the portion of the technical solution may be embodied in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be an individual) A computer, server, or network device, etc.) performs all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

The above is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the claims.

Claims

A method for comparing data of a data table, the method being applied to a system for comparing data of a target data table of a first database and a second database, the system comprising a client and a plurality of servers, wherein The first database corresponds to at least one first server, and the second database corresponds to at least one second server. The method includes:

The client acquires first metadata of the target data table in the first database and second metadata of the target data table in the second database, where the first metadata includes the target The data of the data table is in a first range corresponding to the server of the first database, and the second metadata includes the second range corresponding to the data of the target data table in the server of the second database ;

Determining, by the client, a target range according to at least one of the first range and the second range;

The at least one first server signs the data of the target data table in the first database according to the target range to obtain a first signature;

And the at least one second server signs the data of the target data table in the second database according to the target range to obtain a second signature;

The client determines, according to the first signature and the second signature, whether data of the target data table in the first database is the same as data of the target data table in the second database.
The method according to claim 1, wherein each server of the first database corresponds to a first server, and the first range includes data of the target data table in each of the first databases. a sub-scope of the server, each server of the second database corresponds to a second server, and the second range includes a sub-range of data of the target data table in each server of the second database, Determining, by the client, the target range according to at least one of the first range and the second range, including:

Determining, by the client, the sub-range of each server of the first database and the data of the target data table in a sub-range of each server of the second database according to data of the target data table A sub-range of the target range, the data corresponding to each of the sub-ranges is distributed on one server in the first database, and distributed on one server in the second database.
The method according to claim 2, wherein the at least one first server signs the data of the target data table in the first database according to the target range to obtain a first signature, the at least Before the second server sends the data of the target data table in the second database to obtain the second signature according to the target range, the method further includes:

At least one of the client, the at least one first server, and the at least one second server performs tree segmentation for each of the sub-ranges;

And signing, by the at least one first server, the data of the target data table in the first database to obtain a first signature, according to the target scope, the method, the at least one first server, according to the tree segment And signing the segment of the data of the target data table in the first database to obtain the first signature of the tree type;

And signing, by the at least one second server, the data of the target data table in the second database to obtain a second signature, according to the target range, the method includes: the at least one second server is segmented according to the tree segment And signing the segment of the data of the target data table in the second database to obtain the second signature of the tree.
The method according to claim 3, wherein said client is based on said first signature and said Determining whether the data of the target data table in the first database is the same as the data of the target data table in the second database, including:

Determining, by the client, whether signatures of the same layer of the first signature and the second signature tree are consistent according to the first signature of the tree type and the second signature of the tree type, when the signatures are inconsistent, Determining that the segment corresponding to the layer is different in data of the target data table in the first database from data in the target data table in the second database.
The method according to claim 3 or 4, wherein at least one of the client, the at least one first server, and the at least one second server is performed for each of the sub-ranges Tree segmentation, including:

The at least one first server and the at least one second server perform statistics on density of data in the target range;

The at least one first server and the at least one second server perform tree segmentation for each of the sub-ranges according to a statistical result.
The method according to any one of claims 1 to 5, wherein the at least one first server signs the data of the target data table in the first database according to the target range to obtain the first The signature includes: the at least one first server signs the data of the target data table in the first database by using a hash algorithm according to the target range to obtain a first signature;

And signing, by the at least one second server, the data of the target data table in the second database to obtain a second signature, according to the target range, that: the at least one second server passes the target range according to the target range The hash algorithm signs the data of the target data table in the second database to obtain a second signature.
A system for comparing data of a data table, wherein the system is configured to compare data of a target data table of a first database and a second database, the system comprising a computing device running a client and a plurality of servers running a server The first database includes at least one first server running a first server, and the second database includes at least one second server running a second server:

The computing device is configured to acquire first metadata of the target data table in the first database and second metadata of the target data table in the second database, where the first metadata includes The data of the target data table is in a first range corresponding to the server of the first database, and the second metadata includes data corresponding to the data of the target data table in a server of the second database Two ranges;

The computing device is further configured to determine a target range according to at least one of the first range and the second range;

The at least one first server is configured to sign data of the target data table in the first database according to the target range to obtain a first signature;

The at least one second server is configured to sign data of the target data table in the second database according to the target range to obtain a second signature;

The computing device is further configured to determine, according to the first signature and the second signature, whether data of the target data table in the first database is the same as data of the target data table in the second database.
The system according to claim 7, wherein each server in the first database for storing the target data table is the first server running the first server, the first The range includes data of the target data table in a sub-range of each of the first servers of the first database, and each server in the second database for storing the target data table is running The second server a second server, the second range includes a sub-range of the data of the target data table in each of the second servers of the second database, where the computing device is specifically configured to:

Depending on the data of the target data table, the sub-range of each of the first servers of the first database and the data of the target data table are in a sub-range of each of the second servers of the second database Determining a sub-range of the target range, data corresponding to each of the sub-ranges is distributed on one server in the first database, and distributed on one server in the second database.
The system according to claim 8, wherein the first server signs the data of the target data table in the first database according to the target range to obtain a first signature, and the second server is configured according to The target range, before the data of the target data table in the second database is signed to obtain the second signature,

At least one of the computing device, the at least one first server, and the at least one second server is configured to perform tree segmentation for each of the sub-ranges;

The at least one first server is specifically configured to: according to the tree segment, sign a segment of data of the target data table in the first database to obtain the first signature of the tree type;

The at least one second server is specifically configured to: according to the tree segment, sign a segment of data of the target data table in the second database to obtain the second signature of the tree.
The system of claim 9 wherein said computing device is specifically configured to:

Determining, according to the first signature of the tree type and the second signature of the tree type, whether signatures of the same layer of the first signature and the second signature tree are consistent, and when the signatures are inconsistent, determining the layer The data of the corresponding segment in the target data table in the first database is different from the data in the target data table in the second database.
A system according to claim 9 or 10, characterized in that

The at least one first server and the at least one second server are configured to perform statistics on density of data in the target range;

The at least one first server and the at least one second server are configured to perform tree segmentation for each of the sub-ranges according to a statistical result.
The system according to any one of claims 7 to 11, wherein the at least one first server is specifically configured to: target a target data table in the first database by a hash algorithm according to the target range The data is signed to obtain the first signature;

The at least one second server is specifically configured to: according to the target range, sign the data of the target data table in the second database by using a hash algorithm to obtain a second signature.