WO2018121025A1 - Method and system for comparing data of data table - Google Patents

Method and system for comparing data of data table Download PDF

Info

Publication number
WO2018121025A1
WO2018121025A1 PCT/CN2017/108196 CN2017108196W WO2018121025A1 WO 2018121025 A1 WO2018121025 A1 WO 2018121025A1 CN 2017108196 W CN2017108196 W CN 2017108196W WO 2018121025 A1 WO2018121025 A1 WO 2018121025A1
Authority
WO
WIPO (PCT)
Prior art keywords
server
database
data
target
range
Prior art date
Application number
PCT/CN2017/108196
Other languages
French (fr)
Chinese (zh)
Inventor
崔鑫
杨磊
蔺若林
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2018121025A1 publication Critical patent/WO2018121025A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/12Applying verification of the received information
    • H04L63/126Applying verification of the received information the source of the received data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0861Generation of secret information including derivation or calculation of cryptographic keys or passwords
    • H04L9/0866Generation of secret information including derivation or calculation of cryptographic keys or passwords involving user or device identifiers, e.g. serial number, physical or biometrical information, DNA, hand-signature or measurable physical characteristics

Definitions

  • the present application relates to the field of databases and, more particularly, to a method and system for comparing data of a data table.
  • the key-value database is the best choice for dealing with a large number of random writes and random read scenes. All data in the key-value database exists in the form of key-value.
  • the key-value form has a strictly defined structure, and all data in the database exists in the underlying file system as unreversible files. The new data is written, a new key-value is generated; the old data is rewritten or deleted, and a new key-value is generated to mark the rewrite or delete.
  • the big data field usually takes the function of backing up data offsite in multiple data center solutions. Therefore, verifying the consistency of data before, during and after backup data has become an important feature in the field of big data storage.
  • Comparison tools are data-based comparison tools.
  • the comparison tool When using the comparison tool to compare the data of two databases (working database and backup database) (the structure of the data tables in the two databases should be the same), the comparison tool will parallelize the verification task. For example, submitting a MapReduce (MR) job is distributed to many nodes for parallel execution. The comparison tool reads data from the data tables of the two databases and compares them to obtain inconsistent data.
  • MR MapReduce
  • the existing comparison tool compares the data in the data table line by line, the comparison efficiency is low, and the comparison tool runs slowly.
  • the existing comparison technology requires the mapping framework to communicate with multiple servers of the cluster of the local database locally, and may also need to communicate with the server of the cluster of the remote database, which consumes a large amount of network resources.
  • the present application provides a method and system for comparing data of a data table, which can avoid a large amount of data transmission and comparison, has a fast running speed and low cost, and has a small amount of network resources.
  • a first aspect of the present application provides a method of comparing data of a data table, the method being applied to a system for comparing data of a target data table of a first database and a second database, the system comprising a client and a plurality of servers, wherein the first database corresponds to at least one first server, and the second database corresponds to at least one second server, the method comprising: the client acquiring the first database Decoding first metadata of the target data table and second metadata of the target data table in the second database, wherein the first metadata includes data of the target data table in a server of the first database a first range corresponding to the second range, wherein the second metadata includes a second range corresponding to the data of the target data table in a server of the second database; the client is according to the first range and Determining a target range by at least one of the second ranges; and the data of the target data table in the first database according to the target range by the at least one first server A first signature is a signature line; the at least one second server according to the
  • the client determines the target range according to the distribution of the data of the data table, and the server signs the data according to the target range, and the client compares the signature corresponding to the data of the data table in the two databases. Consistently, it can be judged whether the data of the two data tables are consistent, avoiding a large amount of data transmission and comparison, and the running speed is fast and the cost is low, and the network resource occupation amount is small.
  • each server of the first database corresponds to a first server
  • the first range includes data of the target data table in each of the first databases.
  • a sub-scope of the server each server of the second database corresponds to a second server
  • the second range includes a sub-range of data of the target data table in each server of the second database
  • the range and the data of the target data table are in a sub-range of each server of the second database, determining a sub-range of the target range, and data corresponding to each of the sub-ranges is distributed in the first database On one server, and distributed on one server in the second database.
  • cross-RS data transmission across servers
  • the data of the target data table in the first database is signed by the at least one first server according to the target range to obtain a first signature
  • the method further includes: the client, the at least one first server Performing a tree segmentation for each of the sub-ranges with at least one of the at least one second server; the at least one first server is configured to target data in the first database according to the target range
  • the data of the table is signed to obtain the first signature, including: the at least one first server signs the segment of the data of the target data table in the first database according to the tree segment to obtain a tree type Decoding a first signature; the at least one second server, according to the target range, signing data of the target data table in the second database to obtain a second signature, including: At least one second server based on the tree segment, the second segment of data in the target database data table with the second signature is a signature
  • the client determines, according to the first signature and the second signature, data of a target data table in the first database and a target in the second database Whether the data of the data table is the same, including: the client determining the same layer of the first signature and the second signature tree according to the first signature of the tree type and the second signature of the tree type Whether the signatures are consistent.
  • the signatures are inconsistent, it is determined that the data of the target data table in the first database in the segment corresponding to the layer is different from the data in the target data table in the second database.
  • At least one of the client, the at least one first server, and the at least one second server performs a tree type for each of the sub-ranges Segmentation, comprising: the at least one first server and the at least one second server counting statistics on density of data in the target range; the at least one first server and the at least one second service According to the statistical results, for each of the children
  • the range is tree segmented.
  • the at least one first server according to the target range, signatures data of the target data table in the first database to obtain a first signature, including: And the at least one first server performs a first signature on the data of the target data table in the first database by using a hash algorithm according to the target range; and the at least one second server is configured according to the target range.
  • the data of the target data table in the second database is signed to obtain a second signature, and the method includes: the at least one second server, according to the target range, data of the target data table in the second database by using a hash algorithm Sign the signature to get the second signature.
  • a second aspect of the present application provides a system for comparing data of a data table, wherein the system is configured to compare data of a target data table of a first database and a second database, the system including a computing device running a client And running a plurality of servers of the server, wherein the first database comprises at least one first server running a first server, and the second database comprises at least one second server running a second server: the computing device is for Acquiring first metadata of the target data table in the first database and second metadata of the target data table in the second database, where the first metadata includes data of the target data table a first range corresponding to the server in the first database, where the second metadata includes a second range corresponding to data of the target data table in a server of the second database; the calculating The device is further configured to determine a target range according to at least one of the first range and the second range; the at least one first server And signing, according to the target range, data of the target data table in the first database to obtain a first signature; the at least one second server is
  • each server in the first database for storing the target data table is the first server running the first server
  • the first The range includes data of the target data table in a sub-range of each of the first servers of the first database
  • each server in the second database for storing the target data table is running
  • the second range includes a sub-range of the data of the target data table in each of the second servers of the second database
  • the computing device is specifically configured to:
  • the data of the target data table is in a sub-range of each of the first servers of the first database and data of the target data table is in a sub-range of each of the second servers of the second database, Determining a sub-range of the target range, data corresponding to each of the sub-ranges is distributed on one server in the first database, and distributed on one server in the second database.
  • the first server signs, according to the target range, data of a target data table in the first database to obtain a first signature
  • the second server is configured according to The target range, at least one of the computing device, the at least one first server, and the at least one second server before signing data of the target data table in the second database to obtain a second signature
  • the at least one first server is specifically configured to: perform segmentation of data of the target data table in the first database according to the tree segment Signing the first signature of the tree type
  • the at least one second server is specifically configured to: sign the segment of the data of the target data table in the second database according to the tree segment to obtain a tree type The second signature.
  • the computing device is specifically configured to: determine, according to the first signature of a tree type and the second signature of a tree, the first signature and the first Whether the signatures of the same layer of the two signed trees are consistent. When the signatures are inconsistent, it is determined that the data of the target data table in the first database is different from the data of the target data table in the second database. .
  • the at least one first server and the at least one second server are configured to perform statistics on density of data in the target range; the at least one first server and The at least one second server is configured to perform tree segmentation for each of the sub-ranges according to a statistical result.
  • the at least one first server is specifically configured to: according to the target range, sign the data of the target data table in the first database by using a hash algorithm a signature; the at least one second server is specifically configured to: according to the target range, sign the data of the target data table in the second database by using a hash algorithm to obtain a second signature.
  • the third aspect of the present application provides a storage medium in which a program is stored, and when the program is run by a computing device and a server, the computing device and the server perform the foregoing first aspect or any implementation of the first aspect.
  • the storage medium includes, but is not limited to, a read only memory, a random access memory, a flash memory, an HDD, or an SSD.
  • a fourth aspect of the present application provides a computer program product comprising program instructions for performing the foregoing first aspect or first aspect when the computer program product is executed by a computing device and a server
  • An implementation provides a method of comparing data of a data table.
  • the computer program product can be a software installation package, and in the case of the method of comparing the data of the data table provided by any of the foregoing first aspect or the first aspect, the computer program product can be downloaded and used in the computing device And execute the computer program product on the server.
  • FIG. 1 is a schematic diagram of a method of comparing data of a data table using a comparison tool.
  • FIG. 2 is a schematic block diagram of a system for comparing data of a data table in accordance with one embodiment of the present invention.
  • FIG. 3 is a schematic block diagram of a system for comparing data of a data table in accordance with another embodiment of the present invention.
  • FIG. 4 is a schematic flow chart of a method of comparing data of a data table according to an embodiment of the present invention.
  • Figure 5 is a schematic illustration of the segmentation target range of one embodiment of the present invention.
  • Figure 6 is a schematic illustration of a segmentation target range in accordance with another embodiment of the present invention.
  • Figure 7 is a schematic illustration of a segmentation target range in accordance with another embodiment of the present invention.
  • Figure 8 is a schematic illustration of a segmentation target range in accordance with another embodiment of the present invention.
  • Figure 9 is a schematic illustration of a segmentation target range in accordance with another embodiment of the present invention.
  • Figure 10 is a schematic illustration of the results of the segmentation of the target range of one embodiment of the present invention.
  • FIG. 11 is a schematic diagram of a tree-type signature in accordance with an embodiment of the present invention.
  • Figure 12 is a schematic block diagram of a computing device or server in accordance with one embodiment of the present invention.
  • the existing comparison tool is a data-based comparison tool.
  • the comparison tool parallelizes the verification tasks.
  • the following is a description of the process of comparing the data of the data tables in the database with the Hadoop database (Hbase) and the existing comparison tool as an example.
  • 1 is a schematic diagram of a method 100 of a prior comparison tool comparing data of a data table.
  • the method 100 includes:
  • the existing comparison tool submits an MR job to the Hbase cluster corresponding to the database of the data center (DC) 1.
  • the remote controller (RM) of the Hbase cluster distributes the MR job to many nodes for parallel execution, that is, assigns the MR job to multiple map tasks.
  • each map task is responsible for comparing a part of the data.
  • Each map task reads data from the HBase clusters of the two data centers DC1 and DC2, then compares the data and prints inconsistent data.
  • each server in the HBase cluster is configured with a Service Area Server (RS), which is used to manage tasks running on the server.
  • RS Service Area Server
  • the existing comparison tool compares the data in the data table line by line, the comparison efficiency is low, and the comparison tool runs slowly.
  • the existing comparison tool not only requires the participation of two HBase clusters, but also requires the cluster to provide the running nodes of the RM jobs, and the comparison tools occupy and operate at a higher cost.
  • the existing comparison technology requires the mapping framework to communicate with the RSs of multiple servers of the HBase cluster of the local database locally, and may also need to communicate with the RS of the server of the HBase cluster of the remote database, which takes up a large amount of Internet resources.
  • an embodiment of the present invention provides a method for comparing data of a data table.
  • 2 shows a schematic block diagram of a system 200 for comparing data of a data table in accordance with an embodiment of the present invention.
  • the system 200 illustrated in Figure 2 is a schematic block diagram of the perspective of software.
  • the system 200 includes a client 210 and a plurality of servers from a software perspective, wherein each database corresponds to at least one server, and the first database corresponds to at least one first server 221, and the second database corresponds to At least one second server 222.
  • FIG. 3 shows a schematic block diagram of a system 300 for comparing data of a data table in accordance with an embodiment of the present invention.
  • system 300 includes a computing device 310 running a client and a plurality of servers running a server.
  • the client 210 can be deployed on the user's computing device 310.
  • the computing device 310 is not usually a server corresponding to any database, that is, a server that is not normally a DC.
  • the first server 221 can be deployed in the first DC corresponding to the first database.
  • a server 321 can be deployed on the second server 322 of the second DC corresponding to the second database.
  • a first server 221 may be deployed on each server of the first database for storing the data table, that is, the server deploying the first server 221 is considered to be the first server 321; and the second database is configured to store data.
  • a second server 222 can be deployed on each server of the table, that is, the server deploying the second server 222 is considered to be the second server 322.
  • a plurality of servers in each database may share a server, which is not limited in this embodiment of the present invention.
  • the number of the first server and the second server shown in FIG. 2, and the number of the first server and the second server shown in FIG. 3 are only schematic, and are not intended to limit the embodiments of the present invention.
  • Metadata is acquired, and the metadata is generally stored in a meta table, and the meta table is usually stored in another database other than the server that stores the data table in the database.
  • the meta table of the first database is schematically shown in FIG. 3 and stored on the third server 323 of the first database, and the meta table of the second database is stored on the fourth server 324 of the second database.
  • the meta table can also be stored. The embodiment of the present invention does not limit this on the server that stores the data table in the database.
  • the server for storing the data table (for example, the first server and the second server) may be regarded as a storage node, and the server is deployed on the storage node, and the server may be part of the function of the RS, or may exist independently with the RS.
  • the server that stores the meta table can be considered a metadata management node.
  • server of the embodiment of the present invention may be used as a function module of the RS, or may be a separate module or unit, which is not limited by the embodiment of the present invention.
  • method 400 includes:
  • the client 210 acquires the first metadata of the target data table in the first database and the second metadata of the target data table in the second database, where the data including the target data table in the first metadata is in the server of the first database.
  • the second metadata includes a second range corresponding to the data of the target data table in the server of the second database;
  • the client 210 determines a target range according to at least one of the first range and the second range.
  • the at least one first server 221 signs the data of the target data table in the first database according to the target range to obtain a first signature
  • the at least one second server 222 signs the data of the target data table in the second database according to the target range to obtain a second signature.
  • the client 210 determines, according to the first signature and the second signature, whether data of the target data table in the first database is the same as data of the target data table in the second database.
  • the client determines the target range according to the data distribution of the data table, and the server signs the data according to the target range, and the client compares whether the signatures corresponding to the data of the data tables in the two databases are consistent. Whether the data of the two data tables are consistent, avoiding a large amount of data transmission and comparison, the operation speed is fast and the cost is low, and the network resource occupation amount is small.
  • the first database and the second database where the target data table to be compared in the embodiment of the present invention are located belong to different databases, and the two databases may further belong to clusters of servers of different data centers.
  • the two databases may belong to the same data center, which is not limited by the embodiment of the present invention.
  • the data table in the database is large, and it is generally required to divide the data table horizontally and store it on multiple servers to enhance the speed of concurrent processing.
  • the client 210 communicates with the server storing the first database and the second database of the target data table, respectively, to obtain the first metadata of the target data table in the first database and the second data of the target data table in the second database.
  • Metadata is generally stored in a meta table.
  • the meta table is usually stored in a database other than the server storing the data table.
  • the meta table can also be stored in a database in the database for storing the data table. This embodiment of the present invention does not limit this.
  • the client 210 obtains two corresponding meta tables of the target data tables of the two databases, that is, obtains the first metadata and the second metadata. It is assumed that each database includes three servers, one RS is run on each server, and each RS corresponds to a region in which the target data table is stored. According to the first metadata and the second metadata, a range distribution corresponding to each region is obtained, that is, a start key and an end key. Wherein, the data including the target data table in the first metadata is in the server of the first database Corresponding the first range, the second metadata includes the second range corresponding to the data of the target data table in the server of the second database. In a specific example, the distribution of the target data table table1 can be as shown in Table 1.
  • the target data table of the first database has a key range of 1-30 on the RS1 of the first database, a key range of 31-80 on the RS2 of the first database, and a range of keys on the RS3 of the first database. It is 81-100.
  • the target data table of the second database has a key range of 1-25 on RS1 of the second database, a key range of 26-60 on RS2 of the second database, and a range of keys on RS3 of the second database. It is 61-100.
  • the client 210 determines the target range according to at least one of the first range and the second range.
  • each server of the first database corresponds to a first server 221, and the first range includes a sub-range of data of the target data table in each server of the first database, and the second database
  • Each server corresponds to a second server 222, and the second range includes data of the target data table in a sub-scope of each server of the second database.
  • the client 210 determines the target range according to at least one of the first range and the second range, and may include: the client 210 according to the data of the target data table, the sub-range of each server of the first database, and the target data table according to the data of the target data table.
  • the data is in a sub-range of each server of the second database, and the sub-range of the target range is determined.
  • the data corresponding to each sub-range is distributed on one server in the first database and distributed on one server in the second database.
  • the client 210 may perform a segmentation of the maximum matching target of the repetition range according to the first range and the second range (ie, the distribution of the start key and the end key) corresponding to the two data tables, respectively, to obtain a target range.
  • the target range includes a plurality of sub-ranges, and the data corresponding to each sub-range is distributed on one server in the first database and distributed on one server in the second database. In this way, when the data is signed, the data transmission between the servers (cross-RS) is no longer required, which can further improve the running speed and reduce the occupation of network resources.
  • a scheme for dividing the sub-range of the target range is described in detail below. This scheme not only makes the sub-range of the target range distributed on one server in the first database, but also distributes it on one server in the second database; and it also ensures that the number of sub-ranges divided is the least.
  • the specific steps of the segmentation can be as follows.
  • Step 1 The client 210 forms two region queues by distributing the target data tables of the two databases on the server in a descending order according to row keys.
  • the first range corresponds to the region queue A (A1, A2, 7)
  • the second range corresponds to the region queue B (B1, B2, ).
  • the client 210 sequentially selects regions from the two region queues.
  • Step 2 The client 210 compares the ranges of the selected two regions (for example, Ax and By) to see if the two regions overlap.
  • the ranges of the selected two regions for example, Ax and By
  • the start key smaller region is output as the already segmented region (ie, a sub-range of the target range), and then the region is removed from the region queue in which the region with the smaller start key is located. Region, then continue to repeat step 2 and continue the comparison.
  • any one of the regions is output as the already-divided region C1 (ie, a sub-range of the target range), and then from the two region queues. Take the next region separately, and then repeat the operation of step 2 to continue the comparison.
  • region B1 is segmented by start key and end key of region A1, and C1, C2, and B1-(the remaining portion of region B1) are obtained.
  • C1 and C2 are saved as the result of the segmentation, and the next region A2 of B1- and region queue A is taken as the two regions to be compared, and the comparison of step 2 is performed.
  • the start key of region B1 is smaller than the start key of region A1, and the end key of region B1 is also smaller than the end key of region A1.
  • the start key of region A1 and the end key of region B1 are used as the segmentation criteria.
  • A1 and region B1 are segmented.
  • the first two regions C1 and region C2 (the sub-ranges of the target range respectively) obtained after the segmentation are output as the result, and the remaining region A1 of the region A1 and the next region B2 of the region queue B are regarded as two to be compared.
  • Region performs a comparison of step 2.
  • the start key of region A1 is used as a segmentation criterion, and region A1 and region B1 are segmented.
  • region A1 and region B1 are segmented.
  • two regions C1 and region C2 are obtained as the segmentation result output, and then the next region A2 of the region queue A and the next region B2 of the region queue B are taken as the two to be compared.
  • the regions are compared in step 2.
  • Step 3 The client 210 sequentially reads the region in the first range and the region in the second range corresponding to the target data table of the two databases until the division is completed.
  • the target range includes 5 sub-ranges, and each sub-range is distributed on one RS whether in the first database or the second database, and does not cross the RS.
  • the client 210 may also use one of the first range and the second range as the target range.
  • the specific manner of dividing the target range is not limited in the embodiment of the present invention.
  • each sub-range of the above target range can be directly used as the finest granularity, and the data of the target data table in the two databases is signed by the server.
  • At least one first server at S330 according to the target range, signatures data of the target data table in the first database to obtain a first signature
  • S340 at least one second server
  • the method 300 may further include: in the client, the at least one first server, and the at least one second server, before the data of the target data table in the second database is signed to obtain the second signature.
  • At least one tree segmentation is performed for each sub-range; and the S330, at least one first server, signs the data of the target data table in the first database to obtain the first signature according to the target range, and may include: at least one first server according to the at least one first server a tree segment segmenting, signing a segment of the data of the target data table in the first database to obtain a first signature of the tree type; S340, at least one second server terminal performing data on the target data table in the second database according to the target range
  • the signing of the second signature may include: at least one second server signing the segment of the data of the target data table in the second database according to the tree segment to obtain the second signature of the tree.
  • At least one of the client, the at least one first server, and the at least one second server performs tree segmentation for each sub-range, including: at least one first server and at least one second service
  • the terminal performs statistics on the density of the data in the target range; at least one first server and at least one second server perform tree segmentation for each sub-range according to the statistical result.
  • the client 210 encapsulates the information of the sub-range of the segmented target range into a request for the statistical count and sends it to the server of the two databases. Because the data structure of the target data table in the two databases to be compared is the same, it is only necessary for each sub-range to perform statistical counting on the server of any one of the two databases.
  • a load balancing operation is performed on servers in two databases. As shown in Table 2, the sub-range [0-25] is assigned to the second server of the second database (corresponding to RS1) to count the density, and the sub-range [26-30] is assigned to the first service of the first database. The end (corresponding to RS1) is used to count the density.
  • the sub-range [81-100] can be assigned to either the first server of the first database or the second server of the second database. In this way, no RS is idle, and no RS is too busy, which can balance the load of each server.
  • the load balancing of each server may be disregarded, and the client 210 may select the server of any one of the two databases to count the data density; or the client 210 may access the two databases.
  • a database is selected, and the statistical data density is used by the server of the selected database.
  • Table 2 shows the density statistics
  • the RS2 statistics of the second database obtain the density of the data in the sub-range [31-58], and the sub-range is segmented, and the sub-range [31-58] is divided into trees with two branches per layer.
  • the shape, the lowest layer of the tree (ie, the finest segments) are [31-37][38-44][45-51][52-58].
  • the RS2 of the second database encapsulates the information and sends it to the RS2 of the first database.
  • the format may be "start key, end key, least size, child size" as follows, and the value is "31, 58, 7, 2".
  • the RS2 of the first database obtains the information of the tree grouping.
  • the second server of the first database (corresponding to RS2) reads the data according to the tree segment, and signs the segment of the data of the target data table in the first database to obtain the first signature of the tree.
  • reading data is a link that takes a long time. Therefore, the RS2 of the second database can complete the signature while counting the density of the data in the sub-range of the target range.
  • the process of signing the data segment by the server according to the tree segmentation to obtain the tree signature can be as follows.
  • the server performs a signature operation on each of the lowest-level segments of each sub-range tree, and then performs a bottom-up tree construction operation according to the branches of the tree.
  • Figure 11 is a diagram showing the creation of a tree-type signature in accordance with one embodiment of the present invention.
  • the embodiment of the present invention uses a hash algorithm to sign the data.
  • the data may be signed by the Message Digest Algorithm 5 (MD5).
  • the at least one first server of the S330 signs the data of the target data table in the first database according to the target range to obtain the first signature, which may include: at least one first server is configured by the hash algorithm according to the target range.
  • the data of the target data table in a database is signed to obtain a first signature;
  • S340 at least one second server signs the data of the target data table in the second database according to the target range to obtain a second signature, which may include: at least one second The server signs the data of the target data table in the second database by the hash algorithm according to the target range to obtain the second signature.
  • each server After each server is signed, the first signature of the tree or the second signature of the tree can be fed back to the client 210.
  • each sub-range in the embodiment of the present invention corresponds to a tree-shaped signature, so there may be multiple first signatures and multiple second signatures.
  • Each server may also feed back only the signature of the highest layer of the first signature of the tree or the signature of the highest layer of the second signature of the tree to the client 210.
  • the signatures of the highest layer are inconsistent, the signature of the lower layer is sent to the client 210 for comparison, which is not limited by the embodiment of the present invention.
  • Client 210 receives signatures for sub-ranges of target ranges from both databases. The client 210 compares the signatures. If the signatures of the highest layer are equal, it is considered that the contents of the target data tables in the two databases are consistent, and the comparison ends.
  • the client 210 finds that the signatures of the highest layer are not equal, the signatures of the lower layers are compared in turn until the most fine-grained segments with inconsistent signatures are found, and it is determined which data is inconsistent. Alternatively, if the client 210 finds that the signatures of the highest layer are not equal, the server is required to return the signature of the next layer, and the client 210 continues to compare the returned signatures. If any of the signatures are found to be inconsistent, the server is required to continue to return to the next layer. Sign until you find the signature is inconsistent The finest-grained segmentation.
  • the S350 client determines, according to the first signature and the second signature, whether the data of the target data table in the first database is the same as the data of the target data table in the second database, and may include: the client first according to the tree type The signature and the second signature of the tree determine whether the signatures of the same layer of the first signature and the second signature are consistent. When the signatures are inconsistent, the data of the target data table in the first database is determined by the segment corresponding to the layer. The data of the target data table in the second database is different.
  • the client 210 can perform a small-range query on the target data table of the two databases according to the most fine-grained segmentation in which the signatures are inconsistent, and the read data is compared in the client 210 by string comparison, that is, Detailed data sheet differences can be obtained.
  • the embodiment of the present invention may not be used for detailed comparison, and only the data of the target data table is consistent, which is not limited by the embodiment of the present invention.
  • FIG. 12 shows a schematic block diagram of an apparatus 500 in accordance with an embodiment of the present invention, which may correspond to any of the computing devices or servers referred to in FIG. 3 of an embodiment of the present invention.
  • device 500 can include a processor 510, a memory 520, and a network interface 530.
  • the processor 510 can be used to execute the method of the embodiment of the present invention
  • the memory 520 can be used to store code executed by the processor 510
  • the network interface 530 is used to communicate with other devices.
  • the computing device 310 of FIG. 3 can also include an output device or an output interface coupled to the output device for outputting a comparison result.
  • Output devices can include displays, printers, and the like.
  • the processor, memory and network interface in device 500 can communicate with one another via internal connection paths to communicate control and/or data signals.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the system embodiment described above is merely illustrative.
  • the division of the unit is only a logical function division, and the actual implementation may have another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present invention is essentially or The portion that contributes to the prior art or the portion of the technical solution may be embodied in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be an individual) A computer, server, or network device, etc.) performs all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Abstract

Provided in the present application are a method and a system for comparing data of a data table. The system comprises a client and a plurality of servers. A first database corresponds to at least one first server, and a second database corresponds to at least one second server. The client acquires first metadata and second metadata of a target data table in the two databases, the first metadata comprising a first range corresponding to the data of the target data table, the second metadata comprising a second range corresponding to the data of the target data table. The client determines a target range according to at least one of the first range and the second range. The first server signs, according to the target range, the data of the target data table in the first database to obtain a first signature; similarly, the second server obtains a second signature. The client determines, according to the first signature and the second signature, whether the data of the target data table in the two databases is identical, avoiding excessive data transmission and comparison. The invention has the advantages of high operation speed, low cost and low network resources occupancy.

Description

比较数据表的数据的方法和系统Method and system for comparing data of data tables 技术领域Technical field
本申请涉及数据库领域,并且更具体地,涉及一种比较数据表的数据的方法和系统。The present application relates to the field of databases and, more particularly, to a method and system for comparing data of a data table.
背景技术Background technique
对于大数据领域,键值(key-value)数据库是应对大量随机写、随机读场景的一种最佳选择。key-value数据库中的所有数据,均以key-value形式存在。key-value形式具有严格定义的结构,数据库中的所有数据,都以不可改写的文件存在于底层文件系统中。新数据的写入,会生成新的key-value;旧数据的改写或者删除,也会生成新的key-value来标记该改写或者删除。For big data, the key-value database is the best choice for dealing with a large number of random writes and random read scenes. All data in the key-value database exists in the form of key-value. The key-value form has a strictly defined structure, and all data in the database exists in the underlying file system as unreversible files. The new data is written, a new key-value is generated; the old data is rewritten or deleted, and a new key-value is generated to mark the rewrite or delete.
另外,大数据领域为了追求数据更高的可用性和更好的容灾性,通常在多数据中心方案中采取异地备份数据的功能。因此,备份数据前、中、后校验数据的一致性,成为现在大数据存储领域一项重要的功能特性。In addition, in order to pursue higher data availability and better disaster tolerance, the big data field usually takes the function of backing up data offsite in multiple data center solutions. Therefore, verifying the consistency of data before, during and after backup data has become an important feature in the field of big data storage.
现有的比较工具是基于数据的比较工具。当使用该比较工具对两个数据库(工作数据库和备份数据库)的数据表(两个数据库中数据表的结构应是相同的)进行内容比较时,该比较工具会将校验任务进行并行化处理,例如,提交成映射归约(MapReduce,MR)作业分配到很多节点上并行执行。该比较工具分别从两个数据库的数据表中读取数据,进行比较,得到不一致的数据。Existing comparison tools are data-based comparison tools. When using the comparison tool to compare the data of two databases (working database and backup database) (the structure of the data tables in the two databases should be the same), the comparison tool will parallelize the verification task. For example, submitting a MapReduce (MR) job is distributed to many nodes for parallel execution. The comparison tool reads data from the data tables of the two databases and compares them to obtain inconsistent data.
现有的比较工具对数据表中的数据进行逐行比较,比较效率低下,比较工具运行速度慢。另外,现有的比较技术,需要映射框架在本地跟本地数据库的集群的多个服务器进行通信,还可能需要跟远端数据库的集群的服务器进行通信,这会占用大量的网络资源。The existing comparison tool compares the data in the data table line by line, the comparison efficiency is low, and the comparison tool runs slowly. In addition, the existing comparison technology requires the mapping framework to communicate with multiple servers of the cluster of the local database locally, and may also need to communicate with the server of the cluster of the remote database, which consumes a large amount of network resources.
发明内容Summary of the invention
本申请提供一种比较数据表的数据的方法和系统,能够避免大量的数据传输和比较,运行速度快成本低,网络资源占用量小。The present application provides a method and system for comparing data of a data table, which can avoid a large amount of data transmission and comparison, has a fast running speed and low cost, and has a small amount of network resources.
本申请第一方面提供了一种比较数据表的数据的方法,其特征在于,所述方法应用于比较第一数据库和第二数据库的目标数据表的数据的系统,所述系统包括客户端和多个服务端,其中,所述第一数据库对应至少一个第一服务端,所述第二数据库对应至少一个第二服务端,所述方法包括:所述客户端获取所述第一数据库中所述目标数据表的第一元数据和所述第二数据库中所述目标数据表的第二元数据,所述第一元数据中包括所述目标数据表的数据在所述第一数据库的服务器中所对应的第一范围,所述第二元数据中包括所述目标数据表的数据在所述第二数据库的服务器中所对应的第二范围;所述客户端根据所述第一范围和所述第二范围中的至少一个,确定目标范围;所述至少一个第一服务端根据所述目标范围,对所述第一数据库中目标数据表的数据进行签名得到第一签名;所述至少一个第二服务端根据所述目标范围,对所述第二数 据库中目标数据表的数据进行签名得到第二签名;所述客户端根据所述第一签名和所述第二签名,确定所述第一数据库中目标数据表的数据与所述第二数据库中目标数据表的数据是否相同。A first aspect of the present application provides a method of comparing data of a data table, the method being applied to a system for comparing data of a target data table of a first database and a second database, the system comprising a client and a plurality of servers, wherein the first database corresponds to at least one first server, and the second database corresponds to at least one second server, the method comprising: the client acquiring the first database Decoding first metadata of the target data table and second metadata of the target data table in the second database, wherein the first metadata includes data of the target data table in a server of the first database a first range corresponding to the second range, wherein the second metadata includes a second range corresponding to the data of the target data table in a server of the second database; the client is according to the first range and Determining a target range by at least one of the second ranges; and the data of the target data table in the first database according to the target range by the at least one first server A first signature is a signature line; the at least one second server according to the target range, the second number And signing according to the data of the target data table in the library to obtain a second signature; the client determining, according to the first signature and the second signature, data of the target data table in the first database and the second database Is the data in the target data table the same?
第一方面的比较数据表的数据的方法,客户端根据数据表的数据的分布确定目标范围,服务端根据目标范围对数据进行签名,客户端比较两个数据库中数据表的数据对应的签名是否一致即可判断出这两个数据表的数据是否一致,避免了大量的数据传输和比较,运行速度快成本低,网络资源占用量小。In the first aspect of the method for comparing the data of the data table, the client determines the target range according to the distribution of the data of the data table, and the server signs the data according to the target range, and the client compares the signature corresponding to the data of the data table in the two databases. Consistently, it can be judged whether the data of the two data tables are consistent, avoiding a large amount of data transmission and comparison, and the running speed is fast and the cost is low, and the network resource occupation amount is small.
在第一方面的一种可能的实现方式中,所述第一数据库的每个服务器对应一个第一服务端,所述第一范围包括所述目标数据表的数据在所述第一数据库的每个服务器的子范围,所述第二数据库的每个服务器对应一个第二服务端,所述第二范围包括所述目标数据表的数据在所述第二数据库的每个服务器的子范围,所述客户端根据所述第一范围和所述第二范围中的至少一个,确定目标范围,包括:所述客户端根据所述目标数据表的数据在所述第一数据库的每个服务器的子范围和所述目标数据表的数据在所述第二数据库的每个服务器的子范围,确定所述目标范围的子范围,每个所述子范围对应的数据在所述第一数据库中分布在一个服务器上,并且在所述第二数据库中分布在一个服务器上。本实现方式使得后续在对数据进行签名时,不再需要跨服务器(跨RS)进行数据传输,可以进一步提高运行速度,降低网络资源的占用量。In a possible implementation manner of the first aspect, each server of the first database corresponds to a first server, and the first range includes data of the target data table in each of the first databases. a sub-scope of the server, each server of the second database corresponds to a second server, and the second range includes a sub-range of data of the target data table in each server of the second database, Determining, by the client, the target range according to at least one of the first range and the second range, including: the client is in the server of each server of the first database according to the data of the target data table. The range and the data of the target data table are in a sub-range of each server of the second database, determining a sub-range of the target range, and data corresponding to each of the sub-ranges is distributed in the first database On one server, and distributed on one server in the second database. In this implementation manner, when data is subsequently signed, data transmission across servers (cross-RS) is no longer required, which can further improve the running speed and reduce the occupation of network resources.
在第一方面的一种可能的实现方式中,在所述至少一个第一服务端根据所述目标范围,对所述第一数据库中目标数据表的数据进行签名得到第一签名,所述至少一个第二服务端根据所述目标范围,对所述第二数据库中目标数据表的数据进行签名得到第二签名之前,所述方法还包括:所述客户端、所述至少一个第一服务端和所述至少一个第二服务端中的至少一种为每个所述子范围进行树型分段;所述至少一个第一服务端根据所述目标范围,对所述第一数据库中目标数据表的数据进行签名得到第一签名,包括:所述至少一个第一服务端根据所述树型分段,对所述第一数据库中目标数据表的数据的分段进行签名得到树型的所述第一签名;所述至少一个第二服务端根据所述目标范围,对所述第二数据库中目标数据表的数据进行签名得到第二签名,包括:所述至少一个第二服务端根据所述树型分段,对所述第二数据库中目标数据表的数据的分段进行签名得到树型的所述第二签名。本实现方式对目标范围的子范围进行树型分段可以得到更细化的签名,能够提高比较签名时的效率。In a possible implementation manner of the first aspect, the data of the target data table in the first database is signed by the at least one first server according to the target range to obtain a first signature, where the at least one Before the second server sends the data of the target data table in the second database to obtain the second signature according to the target range, the method further includes: the client, the at least one first server Performing a tree segmentation for each of the sub-ranges with at least one of the at least one second server; the at least one first server is configured to target data in the first database according to the target range The data of the table is signed to obtain the first signature, including: the at least one first server signs the segment of the data of the target data table in the first database according to the tree segment to obtain a tree type Decoding a first signature; the at least one second server, according to the target range, signing data of the target data table in the second database to obtain a second signature, including: At least one second server based on the tree segment, the second segment of data in the target database data table with the second signature is a signature of a tree. This implementation method can perform tree segmentation on the sub-range of the target range to obtain a more detailed signature, which can improve the efficiency when comparing signatures.
在第一方面的一种可能的实现方式中,所述客户端根据所述第一签名和所述第二签名,确定所述第一数据库中目标数据表的数据与所述第二数据库中目标数据表的数据是否相同,包括:所述客户端根据树型的所述第一签名和树型的所述第二签名,确定所述第一签名和所述第二签名的树的相同层的签名是否一致,当签名不一致时,确定所述层对应的分段在所述第一数据库中目标数据表的数据与所述第二数据库中目标数据表的数据不同。In a possible implementation manner of the first aspect, the client determines, according to the first signature and the second signature, data of a target data table in the first database and a target in the second database Whether the data of the data table is the same, including: the client determining the same layer of the first signature and the second signature tree according to the first signature of the tree type and the second signature of the tree type Whether the signatures are consistent. When the signatures are inconsistent, it is determined that the data of the target data table in the first database in the segment corresponding to the layer is different from the data in the target data table in the second database.
在第一方面的一种可能的实现方式中,所述客户端、所述至少一个第一服务端和所述至少一个第二服务端中的至少一种为每个所述子范围进行树型分段,包括:所述至少一个第一服务端和所述至少一个第二服务端对所述目标范围中数据的密度进行统计;所述至少一个第一服务端和所述至少一个第二服务端根据统计的结果,为每个所述子 范围进行树型分段。本实现方式可以使得各个服务器负载更均衡。In a possible implementation manner of the first aspect, at least one of the client, the at least one first server, and the at least one second server performs a tree type for each of the sub-ranges Segmentation, comprising: the at least one first server and the at least one second server counting statistics on density of data in the target range; the at least one first server and the at least one second service According to the statistical results, for each of the children The range is tree segmented. This implementation can make each server load more balanced.
在第一方面的一种可能的实现方式中,所述至少一个第一服务端根据所述目标范围,对所述第一数据库中目标数据表的数据进行签名得到第一签名,包括:所述至少一个第一服务端根据所述目标范围,通过哈希算法对所述第一数据库中目标数据表的数据进行签名得到第一签名;所述至少一个第二服务端根据所述目标范围,对所述第二数据库中目标数据表的数据进行签名得到第二签名,包括:所述至少一个第二服务端根据所述目标范围,通过哈希算法对所述第二数据库中目标数据表的数据进行签名得到第二签名。In a possible implementation manner of the first aspect, the at least one first server, according to the target range, signatures data of the target data table in the first database to obtain a first signature, including: And the at least one first server performs a first signature on the data of the target data table in the first database by using a hash algorithm according to the target range; and the at least one second server is configured according to the target range. The data of the target data table in the second database is signed to obtain a second signature, and the method includes: the at least one second server, according to the target range, data of the target data table in the second database by using a hash algorithm Sign the signature to get the second signature.
本申请第二方面提供了一种比较数据表的数据的系统,其特征在于,所述系统用于比较第一数据库和第二数据库的目标数据表的数据,所述系统包括运行客户端的计算设备和运行服务端的多个服务器,其中,所述第一数据库包括运行第一服务端的至少一个第一服务器,所述第二数据库包括运行第二服务端的至少一个第二服务器:所述计算设备用于获取所述第一数据库中所述目标数据表的第一元数据和所述第二数据库中所述目标数据表的第二元数据,所述第一元数据中包括所述目标数据表的数据在所述第一数据库的服务器中所对应的第一范围,所述第二元数据中包括所述目标数据表的数据在所述第二数据库的服务器中所对应的第二范围;所述计算设备还用于根据所述第一范围和所述第二范围中的至少一个,确定目标范围;所述至少一个第一服务器用于根据所述目标范围,对所述第一数据库中目标数据表的数据进行签名得到第一签名;所述至少一个第二服务器用于根据所述目标范围,对所述第二数据库中目标数据表的数据进行签名得到第二签名;所述计算设备还用于根据所述第一签名和所述第二签名,确定所述第一数据库中目标数据表的数据与所述第二数据库中目标数据表的数据是否相同。A second aspect of the present application provides a system for comparing data of a data table, wherein the system is configured to compare data of a target data table of a first database and a second database, the system including a computing device running a client And running a plurality of servers of the server, wherein the first database comprises at least one first server running a first server, and the second database comprises at least one second server running a second server: the computing device is for Acquiring first metadata of the target data table in the first database and second metadata of the target data table in the second database, where the first metadata includes data of the target data table a first range corresponding to the server in the first database, where the second metadata includes a second range corresponding to data of the target data table in a server of the second database; the calculating The device is further configured to determine a target range according to at least one of the first range and the second range; the at least one first server And signing, according to the target range, data of the target data table in the first database to obtain a first signature; the at least one second server is configured to: target the target data table in the second database according to the target range The data is signed to obtain a second signature; the computing device is further configured to determine data of the target data table in the first database and target data in the second database according to the first signature and the second signature Whether the data of the table is the same.
在第二方面的一种可能的实现方式中,所述第一数据库中用于存储所述目标数据表的每个服务器均为运行所述第一服务端的所述第一服务器,所述第一范围包括所述目标数据表的数据在所述第一数据库的每个所述第一服务器的子范围,所述第二数据库中用于存储所述目标数据表的每个服务器均为运行所述第二服务端的所述第二服务器,所述第二范围包括所述目标数据表的数据在所述第二数据库的每个所述第二服务器的子范围,所述计算设备具体用于:根据所述目标数据表的数据在所述第一数据库的每个所述第一服务器的子范围和所述目标数据表的数据在所述第二数据库的每个所述第二服务器的子范围,确定所述目标范围的子范围,每个所述子范围对应的数据在所述第一数据库中分布在一个服务器上,并且在所述第二数据库中分布在一个服务器上。In a possible implementation manner of the second aspect, each server in the first database for storing the target data table is the first server running the first server, the first The range includes data of the target data table in a sub-range of each of the first servers of the first database, and each server in the second database for storing the target data table is running The second server of the second server, the second range includes a sub-range of the data of the target data table in each of the second servers of the second database, where the computing device is specifically configured to: The data of the target data table is in a sub-range of each of the first servers of the first database and data of the target data table is in a sub-range of each of the second servers of the second database, Determining a sub-range of the target range, data corresponding to each of the sub-ranges is distributed on one server in the first database, and distributed on one server in the second database.
在第二方面的一种可能的实现方式中,在所述第一服务器根据所述目标范围,对所述第一数据库中目标数据表的数据进行签名得到第一签名,所述第二服务器根据所述目标范围,对所述第二数据库中目标数据表的数据进行签名得到第二签名之前,所述计算设备、所述至少一个第一服务器和所述至少一个第二服务器中的至少一种用于为每个所述子范围进行树型分段;所述至少一个第一服务器具体用于:根据所述树型分段,对所述第一数据库中目标数据表的数据的分段进行签名得到树型的所述第一签名;所述至少一个第二服务器具体用于:根据所述树型分段,对所述第二数据库中目标数据表的数据的分段进行签名得到树型的所述第二签名。 In a possible implementation manner of the second aspect, the first server signs, according to the target range, data of a target data table in the first database to obtain a first signature, where the second server is configured according to The target range, at least one of the computing device, the at least one first server, and the at least one second server before signing data of the target data table in the second database to obtain a second signature For performing tree segmentation for each of the sub-ranges; the at least one first server is specifically configured to: perform segmentation of data of the target data table in the first database according to the tree segment Signing the first signature of the tree type; the at least one second server is specifically configured to: sign the segment of the data of the target data table in the second database according to the tree segment to obtain a tree type The second signature.
在第二方面的一种可能的实现方式中,所述计算设备具体用于:根据树型的所述第一签名和树型的所述第二签名,确定所述第一签名和所述第二签名的树的相同层的签名是否一致,当签名不一致时,确定所述层对应的分段在所述第一数据库中目标数据表的数据与所述第二数据库中目标数据表的数据不同。In a possible implementation manner of the second aspect, the computing device is specifically configured to: determine, according to the first signature of a tree type and the second signature of a tree, the first signature and the first Whether the signatures of the same layer of the two signed trees are consistent. When the signatures are inconsistent, it is determined that the data of the target data table in the first database is different from the data of the target data table in the second database. .
在第二方面的一种可能的实现方式中,所述至少一个第一服务器和所述至少一个第二服务器用于对所述目标范围中数据的密度进行统计;所述至少一个第一服务器和所述至少一个第二服务器用于根据统计的结果,为每个所述子范围进行树型分段。In a possible implementation manner of the second aspect, the at least one first server and the at least one second server are configured to perform statistics on density of data in the target range; the at least one first server and The at least one second server is configured to perform tree segmentation for each of the sub-ranges according to a statistical result.
在第二方面的一种可能的实现方式中,所述至少一个第一服务器具体用于:根据所述目标范围,通过哈希算法对所述第一数据库中目标数据表的数据进行签名得到第一签名;所述至少一个第二服务器具体用于:根据所述目标范围,通过哈希算法对所述第二数据库中目标数据表的数据进行签名得到第二签名。In a possible implementation manner of the second aspect, the at least one first server is specifically configured to: according to the target range, sign the data of the target data table in the first database by using a hash algorithm a signature; the at least one second server is specifically configured to: according to the target range, sign the data of the target data table in the second database by using a hash algorithm to obtain a second signature.
本申请第三方面提供了一种存储介质,该存储介质中存储了程序,该程序被计算设备和服务器运行时,该计算设备和服务器执行前述第一方面或第一方面的任一实现方式提供的比较数据表的数据的方法。该存储介质包括但不限于只读存储器,随机访问存储器,快闪存储器、HDD或SSD。The third aspect of the present application provides a storage medium in which a program is stored, and when the program is run by a computing device and a server, the computing device and the server perform the foregoing first aspect or any implementation of the first aspect. The method of comparing the data of the data table. The storage medium includes, but is not limited to, a read only memory, a random access memory, a flash memory, an HDD, or an SSD.
本申请第四方面提供了一种计算机程序产品,该计算机程序产品包括程序指令,当该计算机程序产品被计算设备和服务器执行时,该计算设备和服务器执行前述第一方面或第一方面的任一实现方式提供的比较数据表的数据的方法。该计算机程序产品可以为一个软件安装包,在需要使用前述第一方面或第一方面的任一实现方式提供的比较数据表的数据的方法的情况下,可以下载该计算机程序产品并在计算设备和服务器上执行该计算机程序产品。A fourth aspect of the present application provides a computer program product comprising program instructions for performing the foregoing first aspect or first aspect when the computer program product is executed by a computing device and a server An implementation provides a method of comparing data of a data table. The computer program product can be a software installation package, and in the case of the method of comparing the data of the data table provided by any of the foregoing first aspect or the first aspect, the computer program product can be downloaded and used in the computing device And execute the computer program product on the server.
附图说明DRAWINGS
图1是一种采用比较工具来比较数据表的数据的方法的示意图。1 is a schematic diagram of a method of comparing data of a data table using a comparison tool.
图2是本发明一个实施例的比较数据表的数据的系统的示意性框图。2 is a schematic block diagram of a system for comparing data of a data table in accordance with one embodiment of the present invention.
图3是本发明另一个实施例的比较数据表的数据的系统的示意性框图。3 is a schematic block diagram of a system for comparing data of a data table in accordance with another embodiment of the present invention.
图4是本发明一个实施例的比较数据表的数据的方法的示意性流程图。4 is a schematic flow chart of a method of comparing data of a data table according to an embodiment of the present invention.
图5是本发明一个实施例的切分目标范围的示意图。Figure 5 is a schematic illustration of the segmentation target range of one embodiment of the present invention.
图6是本发明另一个实施例的切分目标范围的示意图。Figure 6 is a schematic illustration of a segmentation target range in accordance with another embodiment of the present invention.
图7是本发明另一个实施例的切分目标范围的示意图。Figure 7 is a schematic illustration of a segmentation target range in accordance with another embodiment of the present invention.
图8是本发明另一个实施例的切分目标范围的示意图。Figure 8 is a schematic illustration of a segmentation target range in accordance with another embodiment of the present invention.
图9是本发明另一个实施例的切分目标范围的示意图。Figure 9 is a schematic illustration of a segmentation target range in accordance with another embodiment of the present invention.
图10是本发明一个实施例的目标范围的切分结果的示意图。Figure 10 is a schematic illustration of the results of the segmentation of the target range of one embodiment of the present invention.
图11是本发明一个实施例的建立树型的签名的示意图。11 is a schematic diagram of a tree-type signature in accordance with an embodiment of the present invention.
图12是本发明一个实施例的计算设备或服务器的示意性框图。Figure 12 is a schematic block diagram of a computing device or server in accordance with one embodiment of the present invention.
具体实施方式detailed description
下面将结合附图,对本发明实施例中的技术方案进行描述。The technical solutions in the embodiments of the present invention will be described below with reference to the accompanying drawings.
对于校验两个数据库中数据的一致性,现有的比较工具是基于数据的比较工具。当 使用该比较工具对两个数据库的数据表进行内容比较时,该比较工具会将校验任务进行并行化处理。For verifying the consistency of data in two databases, the existing comparison tool is a data-based comparison tool. when When comparing the contents of the data tables of the two databases using the comparison tool, the comparison tool parallelizes the verification tasks.
下面结合Hadoop数据库(Hadoop database,Hbase),以现有的比较工具为例,介绍比较数据库中数据表的数据的流程。图1是现有的比较工具比较数据表的数据的方法100的示意图。该方法100包括:The following is a description of the process of comparing the data of the data tables in the database with the Hadoop database (Hbase) and the existing comparison tool as an example. 1 is a schematic diagram of a method 100 of a prior comparison tool comparing data of a data table. The method 100 includes:
S110,现有的比较工具向数据中心(Data Center,DC)1的数据库对应的Hbase集群提交一个MR作业。S110. The existing comparison tool submits an MR job to the Hbase cluster corresponding to the database of the data center (DC) 1.
S120,Hbase集群的远程控制器(Remote Master,RM)将MR作业分配到很多节点上并行执行,即将MR作业分配给多个映射任务(map task)。S120. The remote controller (RM) of the Hbase cluster distributes the MR job to many nodes for parallel execution, that is, assigns the MR job to multiple map tasks.
S130,每一个map task负责一部分数据的比较。每一个map task分别从两个数据中心DC1和DC2的HBase集群读取数据,然后进行数据比较,并打印出不一致的数据。通常,HBase集群中的每个服务器上均配置有服务程序区域服务器(Region Server,RS),RS用于管理服务器上运行的任务。S130, each map task is responsible for comparing a part of the data. Each map task reads data from the HBase clusters of the two data centers DC1 and DC2, then compares the data and prints inconsistent data. Typically, each server in the HBase cluster is configured with a Service Area Server (RS), which is used to manage tasks running on the server.
现有的比较工具对数据表中的数据进行逐行比较,比较效率低下,比较工具运行速度慢。其次,现有的比较工具不仅需要两个HBase集群的参与,还需要集群提供RM作业的运行节点,比较工具的占用和运行成本较高。另外,现有的比较技术,需要映射框架在本地跟本地数据库的HBase集群的多个服务器的RS进行通信,还可能需要跟远端数据库的HBase集群的服务器的RS进行通信,这会占用大量的网络资源。The existing comparison tool compares the data in the data table line by line, the comparison efficiency is low, and the comparison tool runs slowly. Secondly, the existing comparison tool not only requires the participation of two HBase clusters, but also requires the cluster to provide the running nodes of the RM jobs, and the comparison tools occupy and operate at a higher cost. In addition, the existing comparison technology requires the mapping framework to communicate with the RSs of multiple servers of the HBase cluster of the local database locally, and may also need to communicate with the RS of the server of the HBase cluster of the remote database, which takes up a large amount of Internet resources.
基于以上问题,本发明实施例提供了一种比较数据表的数据的方法。图2示出了本发明实施例的比较数据表的数据的系统200的示意性框图。应理解,图2示出的系统200是软件的角度的示意性框图。如图2所示,该系统200从软件上看包括客户端210和多个服务端,其中,每个数据库对应至少一个服务端,第一数据库对应至少一个第一服务端221,第二数据库对应至少一个第二服务端222。Based on the above problem, an embodiment of the present invention provides a method for comparing data of a data table. 2 shows a schematic block diagram of a system 200 for comparing data of a data table in accordance with an embodiment of the present invention. It should be understood that the system 200 illustrated in Figure 2 is a schematic block diagram of the perspective of software. As shown in FIG. 2, the system 200 includes a client 210 and a plurality of servers from a software perspective, wherein each database corresponds to at least one server, and the first database corresponds to at least one first server 221, and the second database corresponds to At least one second server 222.
图3示出了本发明实施例的比较数据表的数据的系统300的示意性框图。应理解,图3示出的系统300是硬件的角度的示意性框图。与图2的软件相对应地,系统300包括运行客户端的计算设备310和运行服务端的多个服务器。客户端210可以部署在用户的计算设备310上,计算设备310通常不是任意一个数据库对应的服务器,即通常不是DC的服务器;第一服务端221可以部署在第一数据库对应的第一DC的第一服务器321上;第二服务端222可以部署在第二数据库对应的第二DC的第二服务器322上。可选地,第一数据库用于存储数据表的每台服务器上可以部署一个第一服务端221,即部署了第一服务端221的服务器认为是第一服务器321;第二数据库用于存储数据表的每台服务器上可以部署一个第二服务端222,即部署了第二服务端222的服务器认为是第二服务器322。当然每个数据库的多台服务器也可以共用一个服务端,本发明实施例对此不作限定。图2示出的第一服务端和第二服务端的数量,以及图3示出的第一服务器和第二服务器的数量仅是示意性的,而非对本发明实施例的限定。FIG. 3 shows a schematic block diagram of a system 300 for comparing data of a data table in accordance with an embodiment of the present invention. It should be understood that the system 300 illustrated in Figure 3 is a schematic block diagram of the perspective of hardware. Corresponding to the software of FIG. 2, system 300 includes a computing device 310 running a client and a plurality of servers running a server. The client 210 can be deployed on the user's computing device 310. The computing device 310 is not usually a server corresponding to any database, that is, a server that is not normally a DC. The first server 221 can be deployed in the first DC corresponding to the first database. A server 321 can be deployed on the second server 322 of the second DC corresponding to the second database. Optionally, a first server 221 may be deployed on each server of the first database for storing the data table, that is, the server deploying the first server 221 is considered to be the first server 321; and the second database is configured to store data. A second server 222 can be deployed on each server of the table, that is, the server deploying the second server 222 is considered to be the second server 322. Of course, a plurality of servers in each database may share a server, which is not limited in this embodiment of the present invention. The number of the first server and the second server shown in FIG. 2, and the number of the first server and the second server shown in FIG. 3 are only schematic, and are not intended to limit the embodiments of the present invention.
另外,本发明实施例中涉及获取元数据(meta data),元数据一般存储在meta表中,meta表通常存储在数据库中区别于存储数据表的服务器以外的另外的服务器上。图3中示意性的示出第一数据库的meta表存储在第一数据库的第三服务器323上,第二数据库的meta表存储在第二数据库的第四服务器324上。当然meta表也可以存储 在数据库中存储数据表的服务器上,本发明实施例对此不作限定。In addition, in the embodiment of the present invention, metadata is acquired, and the metadata is generally stored in a meta table, and the meta table is usually stored in another database other than the server that stores the data table in the database. The meta table of the first database is schematically shown in FIG. 3 and stored on the third server 323 of the first database, and the meta table of the second database is stored on the fourth server 324 of the second database. Of course, the meta table can also be stored. The embodiment of the present invention does not limit this on the server that stores the data table in the database.
应理解,系统300中的计算设备和服务器可以认为是一个节点。其中,用于存储数据表的服务器(例如第一服务器和第二服务器)可以认为是存储节点,存储节点上部署有服务端,服务端可以是RS的部分功能,也可以独立与RS存在。存储meta表的服务器可以认为是元数据管理节点。It should be understood that the computing device and server in system 300 can be considered a node. The server for storing the data table (for example, the first server and the second server) may be regarded as a storage node, and the server is deployed on the storage node, and the server may be part of the function of the RS, or may exist independently with the RS. The server that stores the meta table can be considered a metadata management node.
还应理解,本发明实施例的服务端可以作为RS的一个功能模块,也可作为单独的模块或单元,本发明实施例对此不作限定。It should be understood that the server of the embodiment of the present invention may be used as a function module of the RS, or may be a separate module or unit, which is not limited by the embodiment of the present invention.
图4示出了本发明实施例的比较数据表的数据的方法400的示意性流程图。如图4所示,方法400包括:4 is a schematic flow diagram of a method 400 of comparing data of a data table in accordance with an embodiment of the present invention. As shown in FIG. 4, method 400 includes:
S410,客户端210获取第一数据库中目标数据表的第一元数据和第二数据库中目标数据表的第二元数据,第一元数据中包括目标数据表的数据在第一数据库的服务器中所对应的第一范围,第二元数据中包括目标数据表的数据在第二数据库的服务器中所对应的第二范围;S410, the client 210 acquires the first metadata of the target data table in the first database and the second metadata of the target data table in the second database, where the data including the target data table in the first metadata is in the server of the first database. Corresponding first range, the second metadata includes a second range corresponding to the data of the target data table in the server of the second database;
S420,客户端210根据第一范围和第二范围中的至少一个,确定目标范围;S420. The client 210 determines a target range according to at least one of the first range and the second range.
S430,至少一个第一服务端221根据目标范围,对第一数据库中目标数据表的数据进行签名得到第一签名;S430, the at least one first server 221 signs the data of the target data table in the first database according to the target range to obtain a first signature;
S440,至少一个第二服务端222根据目标范围,对第二数据库中目标数据表的数据进行签名得到第二签名;S440. The at least one second server 222 signs the data of the target data table in the second database according to the target range to obtain a second signature.
S450,客户端210根据第一签名和第二签名,确定第一数据库中目标数据表的数据与第二数据库中目标数据表的数据是否相同。S450. The client 210 determines, according to the first signature and the second signature, whether data of the target data table in the first database is the same as data of the target data table in the second database.
本发明实施例的方法,客户端根据数据表的数据的分布确定目标范围,服务端根据目标范围对数据进行签名,客户端比较两个数据库中数据表的数据对应的签名是否一致即可判断出这两个数据表的数据是否一致,避免了大量的数据传输和比较,运行速度快成本低,网络资源占用量小。In the method of the embodiment of the present invention, the client determines the target range according to the data distribution of the data table, and the server signs the data according to the target range, and the client compares whether the signatures corresponding to the data of the data tables in the two databases are consistent. Whether the data of the two data tables are consistent, avoiding a large amount of data transmission and comparison, the operation speed is fast and the cost is low, and the network resource occupation amount is small.
具体而言,本发明实施例的待比较的目标数据表所在的第一数据库和第二数据库分别属于不同的数据库,两个数据库进一步可以分别属于不同的数据中心的服务器的集群。当然两个数据库也可以属于同一个数据中心,本发明实施例对此不作限定。Specifically, the first database and the second database where the target data table to be compared in the embodiment of the present invention are located belong to different databases, and the two databases may further belong to clusters of servers of different data centers. Of course, the two databases may belong to the same data center, which is not limited by the embodiment of the present invention.
通常,数据库中的数据表较大,一般需将数据表横向切分,在多个服务器存放,以增强并发处理的速度。Generally, the data table in the database is large, and it is generally required to divide the data table horizontally and store it on multiple servers to enhance the speed of concurrent processing.
在S410中,客户端210分别与存放目标数据表的第一数据库和第二数据库的服务器通信,以得到第一数据库中目标数据表的第一元数据和第二数据库中目标数据表的第二元数据。元数据(meta data)一般存储在meta表中,meta表通常存储在数据库中区别于存储数据表的服务器以外的另外的服务器上,当然meta表也可以存储在数据库中存储数据表的服务器上,本发明实施例对此不作限定。In S410, the client 210 communicates with the server storing the first database and the second database of the target data table, respectively, to obtain the first metadata of the target data table in the first database and the second data of the target data table in the second database. Metadata. Meta data is generally stored in a meta table. The meta table is usually stored in a database other than the server storing the data table. Of course, the meta table can also be stored in a database in the database for storing the data table. This embodiment of the present invention does not limit this.
客户端210得到两个数据库的目标数据表的对应的两张meta表,即得到第一元数据和第二元数据。现假设每个数据库分别包括3台服务器,每台服务器上运行一个RS,每个RS对应存储目标数据表的一个区域(region)。根据第一元数据和第二元数据,得到每个region对应一个范围(range)分布情况,即开始键(start key)和结束键(end key)。其中,第一元数据中包括目标数据表的数据在第一数据库的服务器中所 对应的第一范围,第二元数据中包括目标数据表的数据在第二数据库的服务器中所对应的第二范围。在一个具体的例子中,目标数据表table1的分布情况可以如表1所示。The client 210 obtains two corresponding meta tables of the target data tables of the two databases, that is, obtains the first metadata and the second metadata. It is assumed that each database includes three servers, one RS is run on each server, and each RS corresponds to a region in which the target data table is stored. According to the first metadata and the second metadata, a range distribution corresponding to each region is obtained, that is, a start key and an end key. Wherein, the data including the target data table in the first metadata is in the server of the first database Corresponding the first range, the second metadata includes the second range corresponding to the data of the target data table in the server of the second database. In a specific example, the distribution of the target data table table1 can be as shown in Table 1.
表1目标数据表的分布情况Table 1 Distribution of target data tables
Figure PCTCN2017108196-appb-000001
Figure PCTCN2017108196-appb-000001
第一数据库的目标数据表在第一数据库的RS1上的key的范围为1-30,在第一数据库的RS2上的key的范围为31-80,在第一数据库的RS3上的key的范围为81-100。第二数据库的目标数据表在第二数据库的RS1上的key的范围为1-25,在第二数据库的RS2上的key的范围为26-60,在第二数据库的RS3上的key的范围为61-100。The target data table of the first database has a key range of 1-30 on the RS1 of the first database, a key range of 31-80 on the RS2 of the first database, and a range of keys on the RS3 of the first database. It is 81-100. The target data table of the second database has a key range of 1-25 on RS1 of the second database, a key range of 26-60 on RS2 of the second database, and a range of keys on RS3 of the second database. It is 61-100.
在S320中,客户端210根据第一范围和第二范围中的至少一个,确定目标范围。In S320, the client 210 determines the target range according to at least one of the first range and the second range.
可选地,上述例子中的分布符合:第一数据库的每个服务器对应一个第一服务端221,第一范围包括目标数据表的数据在第一数据库的每个服务器的子范围,第二数据库的每个服务器对应一个第二服务端222,第二范围包括目标数据表的数据在第二数据库的每个服务器的子范围。S420中客户端210根据第一范围和第二范围中的至少一个,确定目标范围,可以包括:客户端210根据目标数据表的数据在第一数据库的每个服务器的子范围和目标数据表的数据在第二数据库的每个服务器的子范围,确定目标范围的子范围,每个子范围对应的数据在第一数据库中分布在一个服务器上,并且在第二数据库中分布在一个服务器上。Optionally, the distribution in the above example is consistent: each server of the first database corresponds to a first server 221, and the first range includes a sub-range of data of the target data table in each server of the first database, and the second database Each server corresponds to a second server 222, and the second range includes data of the target data table in a sub-scope of each server of the second database. The client 210 determines the target range according to at least one of the first range and the second range, and may include: the client 210 according to the data of the target data table, the sub-range of each server of the first database, and the target data table according to the data of the target data table. The data is in a sub-range of each server of the second database, and the sub-range of the target range is determined. The data corresponding to each sub-range is distributed on one server in the first database and distributed on one server in the second database.
具体地,客户端210可以根据两个数据表分别对应的第一范围和第二范围(即start key和end key的分布),作出重复范围最大匹配目标的切分,得到目标范围。目标范围包括多个子范围,每个子范围对应的数据在第一数据库中分布在一个服务器上,并且在第二数据库中分布在一个服务器上。这样,后续在对数据进行签名时,不再需要跨服务器(跨RS)进行数据传输,可以进一步提高运行速度,降低网络资源的占用量。Specifically, the client 210 may perform a segmentation of the maximum matching target of the repetition range according to the first range and the second range (ie, the distribution of the start key and the end key) corresponding to the two data tables, respectively, to obtain a target range. The target range includes a plurality of sub-ranges, and the data corresponding to each sub-range is distributed on one server in the first database and distributed on one server in the second database. In this way, when the data is signed, the data transmission between the servers (cross-RS) is no longer required, which can further improve the running speed and reduce the occupation of network resources.
下面详细介绍一种划分目标范围的子范围的方案。这种方案不仅使得目标范围的子范围在第一数据库中分布在一个服务器上,并且在第二数据库中分布在一个服务器上;而且还能保证划分出的子范围的个数最少。切分的具体步骤可以如下。A scheme for dividing the sub-range of the target range is described in detail below. This scheme not only makes the sub-range of the target range distributed on one server in the first database, but also distributes it on one server in the second database; and it also ensures that the number of sub-ranges divided is the least. The specific steps of the segmentation can be as follows.
步骤1.客户端210按照行键(row key)从小到大的顺序将两个数据库的目标数据表在服务器上分布的范围形成两个region队列。第一范围对应region队列A(A1,A2,…),第二范围对应region队列B(B1,B2,…)。客户端210分别从两个region队列中依次选取region。 Step 1. The client 210 forms two region queues by distributing the target data tables of the two databases on the server in a descending order according to row keys. The first range corresponds to the region queue A (A1, A2, ...), and the second range corresponds to the region queue B (B1, B2, ...). The client 210 sequentially selects regions from the two region queues.
步骤2.客户端210比较被选中两个region(例如Ax和By)的范围(range),看这两个region是否有重叠。这里又分为几种情况: Step 2. The client 210 compares the ranges of the selected two regions (for example, Ax and By) to see if the two regions overlap. Here are divided into several situations:
a)如果两个region没有重叠,则将start key较小region作为已经切分好的region(即目标范围的一个子范围)输出,然后从start key较小的region所在的region队列中取下一个region,然后继续重复步骤2的操作,继续比较。a) If the two regions do not overlap, the start key smaller region is output as the already segmented region (ie, a sub-range of the target range), and then the region is removed from the region queue in which the region with the smaller start key is located. Region, then continue to repeat step 2 and continue the comparison.
b)如果两个region有重叠,又可以分为几种情况: b) If the two regions overlap, they can be divided into several cases:
I.完全重叠的情况:I. Fully overlapping situations:
如图5所示,当两个region(A1和B1)完全重叠时,将其中任意一个region作为已经切分好的region C1(即目标范围的一个子范围)输出,然后从两个region队列中分别取出下一个region,然后继续重复步骤2的操作,继续比较。As shown in FIG. 5, when two regions (A1 and B1) completely overlap, any one of the regions is output as the already-divided region C1 (ie, a sub-range of the target range), and then from the two region queues. Take the next region separately, and then repeat the operation of step 2 to continue the comparison.
II.部分重叠的情况(start key相同,end key不同):II. Partial overlap (same start key, different end key):
如图6所示,当两个region(A1和B1)有部分重叠时,截取重叠部分,作为已经切分好的region C1(即目标范围的一个子范围)输出。将B1进行截取,并将剩下的部分region B1-作为新的region跟region队列A的下一个region A2进行步骤2的比较。As shown in FIG. 6, when the two regions (A1 and B1) partially overlap, the overlapping portion is intercepted and output as the already-divided region C1 (ie, a sub-range of the target range). B1 is intercepted, and the remaining part region B1- is compared with the next region A2 of the region queue A as a new region.
III.部分重叠的情况(start key不同,end key也不同,一个region包含另一个region的情况):III. Partial overlap (the start key is different, the end key is also different, and one region contains another region):
如图7所示,当region B1中完全包含region A1时,用region A1的start key和end key将region B1做切分,得到的C1,C2和B1-(region B1剩余的部分)。将C1和C2(分别为目标范围的子范围)作为切分后的结果保存,将B1-和region队列A的下一个region A2作为待比较的两个region,进行步骤2的比较。As shown in FIG. 7, when region A1 is completely included in region B1, region B1 is segmented by start key and end key of region A1, and C1, C2, and B1-(the remaining portion of region B1) are obtained. C1 and C2 (subranges of the target range, respectively) are saved as the result of the segmentation, and the next region A2 of B1- and region queue A is taken as the two regions to be compared, and the comparison of step 2 is performed.
IV.部分重叠的情况(start key不同,end key也不同,不存在一个region包含另一个region的情况):IV. Partial overlap (the start key is different, the end key is also different, there is no case where one region contains another region):
如图8所示,region B1的start key小于region A1的start key,region B1的end key也小于region A1的end key,将region A1的start key和region B1的end key作为切分标准,对region A1和region B1进行切分。切分后得到的前两个region C1和region C2(分别为目标范围的子范围)作为结果输出,将region A1剩下的部分A1-和region队列B的下一个region B2作为待比较的两个Region进行步骤2的比较。As shown in Figure 8, the start key of region B1 is smaller than the start key of region A1, and the end key of region B1 is also smaller than the end key of region A1. The start key of region A1 and the end key of region B1 are used as the segmentation criteria. A1 and region B1 are segmented. The first two regions C1 and region C2 (the sub-ranges of the target range respectively) obtained after the segmentation are output as the result, and the remaining region A1 of the region A1 and the next region B2 of the region queue B are regarded as two to be compared. Region performs a comparison of step 2.
V.部分重叠的情况(start key不同,end key相同):V. Partial overlap (start key is different, end key is the same):
在如图9所示的例子中,将region A1的start key作为切分标准,对region A1和region B1进行切分。切分后得到两个region C1和region C2(分别为目标范围的子范围)作为切分结果输出,然后将region队列A的下一个region A2和region队列B的下一个region B2作为待比较的两个region进行步骤2的比较。In the example shown in FIG. 9, the start key of region A1 is used as a segmentation criterion, and region A1 and region B1 are segmented. After segmentation, two regions C1 and region C2 (subranges of the target range respectively) are obtained as the segmentation result output, and then the next region A2 of the region queue A and the next region B2 of the region queue B are taken as the two to be compared. The regions are compared in step 2.
步骤3.客户端210依次读取两个数据库的目标数据表对应的第一范围中的region和第二范围中的region,直到划分完毕。Step 3. The client 210 sequentially reads the region in the first range and the region in the second range corresponding to the target data table of the two databases until the division is completed.
对表1所示的例子中目标数据表的第一范围中的region和第二范围中的region进行划分后的结果如图10所示。目标范围包括5个子范围,每个子范围不论是在第一数据库还是在第二数据库均分布在一个RS上,不会跨RS。The result of dividing the region in the first range of the target data table and the region in the second range in the example shown in Table 1 is as shown in FIG. The target range includes 5 sub-ranges, and each sub-range is distributed on one RS whether in the first database or the second database, and does not cross the RS.
可选地,在S320中,客户端210也可以将第一范围和第二范围中的一个,作为目标范围,本发明实施例对划分目标范围的具体方式不作限定。Optionally, in S320, the client 210 may also use one of the first range and the second range as the target range. The specific manner of dividing the target range is not limited in the embodiment of the present invention.
在确定目标范围之后,可以直接以上述目标范围的各个子范围作为最细粒度,通过服务端对两个数据库中目标数据表的数据进行签名。After determining the target range, each sub-range of the above target range can be directly used as the finest granularity, and the data of the target data table in the two databases is signed by the server.
可选地,作为一个实施例,在S330至少一个第一服务端根据所述目标范围,对所述第一数据库中目标数据表的数据进行签名得到第一签名,S340至少一个第二服务端 根据所述目标范围,对所述第二数据库中目标数据表的数据进行签名得到第二签名之前,方法300还可以包括:客户端、至少一个第一服务端和至少一个第二服务端中的至少一个为每个子范围进行树型分段;S330至少一个第一服务端根据目标范围,对第一数据库中目标数据表的数据进行签名得到第一签名,可以包括:至少一个第一服务端根据树型分段,对第一数据库中目标数据表的数据的分段进行签名得到树型的第一签名;S340至少一个第二服务端根据目标范围,对第二数据库中目标数据表的数据进行签名得到第二签名,可以包括:至少一个第二服务端根据树型分段,对第二数据库中目标数据表的数据的分段进行签名得到树型的第二签名。这样,对目标范围的子范围进行树型分段可以得到更细化的签名,能够提高比较签名时的效率。Optionally, as an embodiment, at least one first server at S330, according to the target range, signatures data of the target data table in the first database to obtain a first signature, and S340 at least one second server The method 300 may further include: in the client, the at least one first server, and the at least one second server, before the data of the target data table in the second database is signed to obtain the second signature. At least one tree segmentation is performed for each sub-range; and the S330, at least one first server, signs the data of the target data table in the first database to obtain the first signature according to the target range, and may include: at least one first server according to the at least one first server a tree segment segmenting, signing a segment of the data of the target data table in the first database to obtain a first signature of the tree type; S340, at least one second server terminal performing data on the target data table in the second database according to the target range The signing of the second signature may include: at least one second server signing the segment of the data of the target data table in the second database according to the tree segment to obtain the second signature of the tree. In this way, tree segmentation of the sub-range of the target range can result in a more detailed signature, which can improve the efficiency of comparing signatures.
下面结合一个具体的实施例说明本发明实施例的方法为每个子范围进行树型分段的过程。在该实施例中,客户端、至少一个第一服务端和至少一个第二服务端中的至少一个为每个子范围进行树型分段,包括:至少一个第一服务端和至少一个第二服务端对目标范围中数据的密度进行统计;至少一个第一服务端和至少一个第二服务端根据统计的结果,为每个子范围进行树型分段。The process of tree segmentation for each sub-range is described below in conjunction with a specific embodiment. In this embodiment, at least one of the client, the at least one first server, and the at least one second server performs tree segmentation for each sub-range, including: at least one first server and at least one second service The terminal performs statistics on the density of the data in the target range; at least one first server and at least one second server perform tree segmentation for each sub-range according to the statistical result.
具体而言,客户端210将切分好的目标范围的子范围的信息封装成统计记数的请求,发送给两个数据库的服务端。因为待比较的两个数据库中的目标数据表的数据结构是相同的,因此对每一个子范围仅需要两个数据库中的任意一个数据库的服务端进行统计记数即可。在本发明的一个实施例中,对两个数据库中的服务端进行负载均衡(load balance)操作。如表2所示,将子范围[0-25]分配给第二数据库的第二服务端(对应于RS1)来统计密度,将子范围[26-30]分配给第一数据库的第一服务端(对应于RS1)来统计密度。子范围[81-100]则分配给第一数据库的第一服务端或第二数据库的第二服务端均可。这样,没有RS是闲置的,也没有RS过于繁忙,可以使得各个服务器的负载均衡。Specifically, the client 210 encapsulates the information of the sub-range of the segmented target range into a request for the statistical count and sends it to the server of the two databases. Because the data structure of the target data table in the two databases to be compared is the same, it is only necessary for each sub-range to perform statistical counting on the server of any one of the two databases. In one embodiment of the invention, a load balancing operation is performed on servers in two databases. As shown in Table 2, the sub-range [0-25] is assigned to the second server of the second database (corresponding to RS1) to count the density, and the sub-range [26-30] is assigned to the first service of the first database. The end (corresponding to RS1) is used to count the density. The sub-range [81-100] can be assigned to either the first server of the first database or the second server of the second database. In this way, no RS is idle, and no RS is too busy, which can balance the load of each server.
当然,在本发明的其他实施例中,可以不考虑各个服务器的负载均衡,客户端210可以选择两个数据库的任意一个数据库的服务端来统计数据密度;或者,客户端210可以从两个数据库中选择一个数据库,统计数据密度均使用被选中的数据库的服务端,本发明实施例对此不作限定。Of course, in other embodiments of the present invention, the load balancing of each server may be disregarded, and the client 210 may select the server of any one of the two databases to count the data density; or the client 210 may access the two databases. In the embodiment of the present invention, a database is selected, and the statistical data density is used by the server of the selected database.
表2密度统计示意Table 2 shows the density statistics
目标范围的子范围Subrange of target range 第一数据库First database 第二数据库Second database
1-251-25 等待wait 统计密度(RS1)Statistical density (RS1)
26-3026-30 统计密度(RS1)Statistical density (RS1) 等待wait
31-5831-58 等待wait 统计密度(RS2)Statistical density (RS2)
59-8059-80 统计密度(RS2)Statistical density (RS2) 等待wait
81-10081-100 等待wait 统计密度(RS3)Statistical density (RS3)
根据表2,第二数据库的RS2统计得到子范围[31-58]中数据的密度,将子范围进行分段后得到,将该子范围[31-58]分成每层具有两个分支的树形,树形的最底层的各段(即粒度最细的各段)分别为[31-37][38-44][45-51][52-58]。第二数据库的RS2将该信息封装起来发送给第一数据库的RS2,格式可以如下“start key,end key,least size,child size”,其值为“31,58,7,2”。第一数据库的RS2收到该信息后,得到树形分组的信息。第一数据库的第二服务端(对应RS2)根据树型分段读取数据,对第一数据库中目标数据表的数据的分段进行签名得到树型的第一签名。According to Table 2, the RS2 statistics of the second database obtain the density of the data in the sub-range [31-58], and the sub-range is segmented, and the sub-range [31-58] is divided into trees with two branches per layer. The shape, the lowest layer of the tree (ie, the finest segments) are [31-37][38-44][45-51][52-58]. The RS2 of the second database encapsulates the information and sends it to the RS2 of the first database. The format may be "start key, end key, least size, child size" as follows, and the value is "31, 58, 7, 2". After receiving the information, the RS2 of the first database obtains the information of the tree grouping. The second server of the first database (corresponding to RS2) reads the data according to the tree segment, and signs the segment of the data of the target data table in the first database to obtain the first signature of the tree.
应理解,在本发明实施例中,读取数据是花费时间较长的一个环节,因此,第二数据库的RS2可以一边对目标范围的子范围中数据的密度进行统计,一边就完成了签名。It should be understood that, in the embodiment of the present invention, reading data is a link that takes a long time. Therefore, the RS2 of the second database can complete the signature while counting the density of the data in the sub-range of the target range.
服务端根据树型分段,对数据分段进行签名得到树型的签名的过程可以如下。服务端对每个子范围的树形的最底层的各段进行签名运算,然后依据树的分支,进行自下而上的建树操作。图11示出了本发明一个实施例的建立树型的签名的示意图。The process of signing the data segment by the server according to the tree segmentation to obtain the tree signature can be as follows. The server performs a signature operation on each of the lowest-level segments of each sub-range tree, and then performs a bottom-up tree construction operation according to the branches of the tree. Figure 11 is a diagram showing the creation of a tree-type signature in accordance with one embodiment of the present invention.
步骤a.先建立最细粒度的分段的数据的签名。例如,v1=[31-37],v2=[38-44],v3=[45-51],v4=[52-58]。Step a. First establish the signature of the most fine-grained segmented data. For example, v1=[31-37], v2=[38-44], v3=[45-51], v4=[52-58].
步骤b.按照树的分支为2的设置,建立上一层的签名。例如,v5=[31-44]=签名(v1,v2),v6=[45-58]=签名(v3,v4)。Step b. According to the setting of the branch of the tree to 2, the signature of the upper layer is established. For example, v5=[31-44]=signature (v1, v2), v6=[45-58]=signature (v3, v4).
步骤c.如果该层的签名个数不为1,重复执行步骤b;如果该层的签名个数为1则结束。最终得到最上边一层的签名v7=[31-58]=签名(v5,v6)。Step c. If the number of signatures of the layer is not 1, repeat step b; if the number of signatures of the layer is 1, the process ends. Finally get the signature of the top layer v7 = [31-58] = signature (v5, v6).
可选地,本发明实施例采用哈希算法对数据进行签名,例如,可以通过消息摘要算法第五版(Message Digest Algorithm 5,MD5)对数据进行签名。相应地,S330至少一个第一服务端根据目标范围,对第一数据库中目标数据表的数据进行签名得到第一签名,可以包括:至少一个第一服务端根据目标范围,通过哈希算法对第一数据库中目标数据表的数据进行签名得到第一签名;S340至少一个第二服务端根据目标范围,对第二数据库中目标数据表的数据进行签名得到第二签名,可以包括:至少一个第二服务端根据目标范围,通过哈希算法对第二数据库中目标数据表的数据进行签名得到第二签名。Optionally, the embodiment of the present invention uses a hash algorithm to sign the data. For example, the data may be signed by the Message Digest Algorithm 5 (MD5). Correspondingly, the at least one first server of the S330 signs the data of the target data table in the first database according to the target range to obtain the first signature, which may include: at least one first server is configured by the hash algorithm according to the target range. The data of the target data table in a database is signed to obtain a first signature; S340: at least one second server signs the data of the target data table in the second database according to the target range to obtain a second signature, which may include: at least one second The server signs the data of the target data table in the second database by the hash algorithm according to the target range to obtain the second signature.
每个服务端得到签名之后可以将树形的第一签名或树形的第二签名反馈给客户端210。应理解,本发明实施例中每个子范围对应一个树形的签名,因此可能存在多个第一签名以及多个第二签名。每个服务端也可以仅将树形的第一签名的最高层的签名或树形的第二签名的最高层的签名反馈给客户端210。当最高层的签名不一致时,再将下层的签名发送给客户端210用于比较,本发明实施例对此不做限定。After each server is signed, the first signature of the tree or the second signature of the tree can be fed back to the client 210. It should be understood that each sub-range in the embodiment of the present invention corresponds to a tree-shaped signature, so there may be multiple first signatures and multiple second signatures. Each server may also feed back only the signature of the highest layer of the first signature of the tree or the signature of the highest layer of the second signature of the tree to the client 210. When the signatures of the highest layer are inconsistent, the signature of the lower layer is sent to the client 210 for comparison, which is not limited by the embodiment of the present invention.
客户端210收到来自两个数据库的目标范围的子范围的签名。客户端210对签名进行比较,如果最高层的签名相等,则认为两个数据库中目标数据表的内容一致,比较结束。Client 210 receives signatures for sub-ranges of target ranges from both databases. The client 210 compares the signatures. If the signatures of the highest layer are equal, it is considered that the contents of the target data tables in the two databases are consistent, and the comparison ends.
如果客户端210发现最高层的签名不相等,则依次比较下层的签名,直到找到签名不一致的最细粒度的分段,确定是哪些数据不一致。或者,客户端210发现最高层的签名不相等,则要求服务端返回下一层的签名,客户端210继续比较返回回来的签名,如果发现其中有不一致的,继续要求服务端返回下一层的签名,直到找到签名不一致 的最细粒度的分段。If the client 210 finds that the signatures of the highest layer are not equal, the signatures of the lower layers are compared in turn until the most fine-grained segments with inconsistent signatures are found, and it is determined which data is inconsistent. Alternatively, if the client 210 finds that the signatures of the highest layer are not equal, the server is required to return the signature of the next layer, and the client 210 continues to compare the returned signatures. If any of the signatures are found to be inconsistent, the server is required to continue to return to the next layer. Sign until you find the signature is inconsistent The finest-grained segmentation.
概括而言,S350客户端根据第一签名和第二签名,确定第一数据库中目标数据表的数据与第二数据库中目标数据表的数据是否相同,可以包括:客户端根据树型的第一签名和树型的第二签名,确定第一签名和第二签名的树的相同层的签名是否一致,当签名不一致时,确定层对应的分段在第一数据库中目标数据表的数据与第二数据库中目标数据表的数据不同。In summary, the S350 client determines, according to the first signature and the second signature, whether the data of the target data table in the first database is the same as the data of the target data table in the second database, and may include: the client first according to the tree type The signature and the second signature of the tree determine whether the signatures of the same layer of the first signature and the second signature are consistent. When the signatures are inconsistent, the data of the target data table in the first database is determined by the segment corresponding to the layer. The data of the target data table in the second database is different.
客户端210可以根据签名不一致的最细粒度的分段,对两个数据库的目标数据表在该分段内做一次小范围查询,读出来的数据在客户端210内部做一下字符串比较,即可得到详细的数据表的差异。本发明实施例也可以不进行详细的比较,只给出目标数据表的数据是否一致即可,本发明实施例对此不作限定。The client 210 can perform a small-range query on the target data table of the two databases according to the most fine-grained segmentation in which the signatures are inconsistent, and the read data is compared in the client 210 by string comparison, that is, Detailed data sheet differences can be obtained. The embodiment of the present invention may not be used for detailed comparison, and only the data of the target data table is consistent, which is not limited by the embodiment of the present invention.
图12示出了本发明实施例的设备500的示意性框图,设备500可以对应于本发明实施例的图3中所涉及的任一计算设备或服务器。如图12所示,设备500可以包括处理器510、存储器520和网络接口530。其中,处理器510可以用于执行本发明实施例的方法,存储器520可以用于存储处理器510所执行的代码,网络接口530用于与其他设备进行通信。图3的计算设备310还可以包括输出设备或与输出设备连接的输出接口,用于输出比较结果。输出设备可以包括显示器,打印机等等。设备500中的处理器、存储器和网络接口之间可以通过内部连接通路互相通信,传递控制和/或数据信号。FIG. 12 shows a schematic block diagram of an apparatus 500 in accordance with an embodiment of the present invention, which may correspond to any of the computing devices or servers referred to in FIG. 3 of an embodiment of the present invention. As shown in FIG. 12, device 500 can include a processor 510, a memory 520, and a network interface 530. The processor 510 can be used to execute the method of the embodiment of the present invention, the memory 520 can be used to store code executed by the processor 510, and the network interface 530 is used to communicate with other devices. The computing device 310 of FIG. 3 can also include an output device or an output interface coupled to the output device for outputting a comparison result. Output devices can include displays, printers, and the like. The processor, memory and network interface in device 500 can communicate with one another via internal connection paths to communicate control and/or data signals.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein can be implemented in electronic hardware or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods for implementing the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the present invention.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。A person skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the system, the device and the unit described above can refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的系统实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the system embodiment described above is merely illustrative. For example, the division of the unit is only a logical function division, and the actual implementation may have another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者 说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product. Based on this understanding, the technical solution of the present invention is essentially or The portion that contributes to the prior art or the portion of the technical solution may be embodied in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be an individual) A computer, server, or network device, etc.) performs all or part of the steps of the methods described in various embodiments of the present invention. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应所述以权利要求的保护范围为准。 The above is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the claims.

Claims (12)

  1. 一种比较数据表的数据的方法,其特征在于,所述方法应用于比较第一数据库和第二数据库的目标数据表的数据的系统,所述系统包括客户端和多个服务端,其中,所述第一数据库对应至少一个第一服务端,所述第二数据库对应至少一个第二服务端,所述方法包括:A method for comparing data of a data table, the method being applied to a system for comparing data of a target data table of a first database and a second database, the system comprising a client and a plurality of servers, wherein The first database corresponds to at least one first server, and the second database corresponds to at least one second server. The method includes:
    所述客户端获取所述第一数据库中所述目标数据表的第一元数据和所述第二数据库中所述目标数据表的第二元数据,所述第一元数据中包括所述目标数据表的数据在所述第一数据库的服务器中所对应的第一范围,所述第二元数据中包括所述目标数据表的数据在所述第二数据库的服务器中所对应的第二范围;The client acquires first metadata of the target data table in the first database and second metadata of the target data table in the second database, where the first metadata includes the target The data of the data table is in a first range corresponding to the server of the first database, and the second metadata includes the second range corresponding to the data of the target data table in the server of the second database ;
    所述客户端根据所述第一范围和所述第二范围中的至少一个,确定目标范围;Determining, by the client, a target range according to at least one of the first range and the second range;
    所述至少一个第一服务端根据所述目标范围,对所述第一数据库中目标数据表的数据进行签名得到第一签名;The at least one first server signs the data of the target data table in the first database according to the target range to obtain a first signature;
    所述至少一个第二服务端根据所述目标范围,对所述第二数据库中目标数据表的数据进行签名得到第二签名;And the at least one second server signs the data of the target data table in the second database according to the target range to obtain a second signature;
    所述客户端根据所述第一签名和所述第二签名,确定所述第一数据库中目标数据表的数据与所述第二数据库中目标数据表的数据是否相同。The client determines, according to the first signature and the second signature, whether data of the target data table in the first database is the same as data of the target data table in the second database.
  2. 根据权利要求1所述的方法,其特征在于,所述第一数据库的每个服务器对应一个第一服务端,所述第一范围包括所述目标数据表的数据在所述第一数据库的每个服务器的子范围,所述第二数据库的每个服务器对应一个第二服务端,所述第二范围包括所述目标数据表的数据在所述第二数据库的每个服务器的子范围,所述客户端根据所述第一范围和所述第二范围中的至少一个,确定目标范围,包括:The method according to claim 1, wherein each server of the first database corresponds to a first server, and the first range includes data of the target data table in each of the first databases. a sub-scope of the server, each server of the second database corresponds to a second server, and the second range includes a sub-range of data of the target data table in each server of the second database, Determining, by the client, the target range according to at least one of the first range and the second range, including:
    所述客户端根据所述目标数据表的数据在所述第一数据库的每个服务器的子范围和所述目标数据表的数据在所述第二数据库的每个服务器的子范围,确定所述目标范围的子范围,每个所述子范围对应的数据在所述第一数据库中分布在一个服务器上,并且在所述第二数据库中分布在一个服务器上。Determining, by the client, the sub-range of each server of the first database and the data of the target data table in a sub-range of each server of the second database according to data of the target data table A sub-range of the target range, the data corresponding to each of the sub-ranges is distributed on one server in the first database, and distributed on one server in the second database.
  3. 根据权利要求2所述的方法,其特征在于,在所述至少一个第一服务端根据所述目标范围,对所述第一数据库中目标数据表的数据进行签名得到第一签名,所述至少一个第二服务端根据所述目标范围,对所述第二数据库中目标数据表的数据进行签名得到第二签名之前,所述方法还包括:The method according to claim 2, wherein the at least one first server signs the data of the target data table in the first database according to the target range to obtain a first signature, the at least Before the second server sends the data of the target data table in the second database to obtain the second signature according to the target range, the method further includes:
    所述客户端、所述至少一个第一服务端和所述至少一个第二服务端中的至少一种为每个所述子范围进行树型分段;At least one of the client, the at least one first server, and the at least one second server performs tree segmentation for each of the sub-ranges;
    所述至少一个第一服务端根据所述目标范围,对所述第一数据库中目标数据表的数据进行签名得到第一签名,包括:所述至少一个第一服务端根据所述树型分段,对所述第一数据库中目标数据表的数据的分段进行签名得到树型的所述第一签名;And signing, by the at least one first server, the data of the target data table in the first database to obtain a first signature, according to the target scope, the method, the at least one first server, according to the tree segment And signing the segment of the data of the target data table in the first database to obtain the first signature of the tree type;
    所述至少一个第二服务端根据所述目标范围,对所述第二数据库中目标数据表的数据进行签名得到第二签名,包括:所述至少一个第二服务端根据所述树型分段,对所述第二数据库中目标数据表的数据的分段进行签名得到树型的所述第二签名。And signing, by the at least one second server, the data of the target data table in the second database to obtain a second signature, according to the target range, the method includes: the at least one second server is segmented according to the tree segment And signing the segment of the data of the target data table in the second database to obtain the second signature of the tree.
  4. 根据权利要求3所述的方法,其特征在于,所述客户端根据所述第一签名和所 述第二签名,确定所述第一数据库中目标数据表的数据与所述第二数据库中目标数据表的数据是否相同,包括:The method according to claim 3, wherein said client is based on said first signature and said Determining whether the data of the target data table in the first database is the same as the data of the target data table in the second database, including:
    所述客户端根据树型的所述第一签名和树型的所述第二签名,确定所述第一签名和所述第二签名的树的相同层的签名是否一致,当签名不一致时,确定所述层对应的分段在所述第一数据库中目标数据表的数据与所述第二数据库中目标数据表的数据不同。Determining, by the client, whether signatures of the same layer of the first signature and the second signature tree are consistent according to the first signature of the tree type and the second signature of the tree type, when the signatures are inconsistent, Determining that the segment corresponding to the layer is different in data of the target data table in the first database from data in the target data table in the second database.
  5. 根据权利要求3或4所述的方法,其特征在于,所述客户端、所述至少一个第一服务端和所述至少一个第二服务端中的至少一种为每个所述子范围进行树型分段,包括:The method according to claim 3 or 4, wherein at least one of the client, the at least one first server, and the at least one second server is performed for each of the sub-ranges Tree segmentation, including:
    所述至少一个第一服务端和所述至少一个第二服务端对所述目标范围中数据的密度进行统计;The at least one first server and the at least one second server perform statistics on density of data in the target range;
    所述至少一个第一服务端和所述至少一个第二服务端根据统计的结果,为每个所述子范围进行树型分段。The at least one first server and the at least one second server perform tree segmentation for each of the sub-ranges according to a statistical result.
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述至少一个第一服务端根据所述目标范围,对所述第一数据库中目标数据表的数据进行签名得到第一签名,包括:所述至少一个第一服务端根据所述目标范围,通过哈希算法对所述第一数据库中目标数据表的数据进行签名得到第一签名;The method according to any one of claims 1 to 5, wherein the at least one first server signs the data of the target data table in the first database according to the target range to obtain the first The signature includes: the at least one first server signs the data of the target data table in the first database by using a hash algorithm according to the target range to obtain a first signature;
    所述至少一个第二服务端根据所述目标范围,对所述第二数据库中目标数据表的数据进行签名得到第二签名,包括:所述至少一个第二服务端根据所述目标范围,通过哈希算法对所述第二数据库中目标数据表的数据进行签名得到第二签名。And signing, by the at least one second server, the data of the target data table in the second database to obtain a second signature, according to the target range, that: the at least one second server passes the target range according to the target range The hash algorithm signs the data of the target data table in the second database to obtain a second signature.
  7. 一种比较数据表的数据的系统,其特征在于,所述系统用于比较第一数据库和第二数据库的目标数据表的数据,所述系统包括运行客户端的计算设备和运行服务端的多个服务器,其中,所述第一数据库包括运行第一服务端的至少一个第一服务器,所述第二数据库包括运行第二服务端的至少一个第二服务器:A system for comparing data of a data table, wherein the system is configured to compare data of a target data table of a first database and a second database, the system comprising a computing device running a client and a plurality of servers running a server The first database includes at least one first server running a first server, and the second database includes at least one second server running a second server:
    所述计算设备用于获取所述第一数据库中所述目标数据表的第一元数据和所述第二数据库中所述目标数据表的第二元数据,所述第一元数据中包括所述目标数据表的数据在所述第一数据库的服务器中所对应的第一范围,所述第二元数据中包括所述目标数据表的数据在所述第二数据库的服务器中所对应的第二范围;The computing device is configured to acquire first metadata of the target data table in the first database and second metadata of the target data table in the second database, where the first metadata includes The data of the target data table is in a first range corresponding to the server of the first database, and the second metadata includes data corresponding to the data of the target data table in a server of the second database Two ranges;
    所述计算设备还用于根据所述第一范围和所述第二范围中的至少一个,确定目标范围;The computing device is further configured to determine a target range according to at least one of the first range and the second range;
    所述至少一个第一服务器用于根据所述目标范围,对所述第一数据库中目标数据表的数据进行签名得到第一签名;The at least one first server is configured to sign data of the target data table in the first database according to the target range to obtain a first signature;
    所述至少一个第二服务器用于根据所述目标范围,对所述第二数据库中目标数据表的数据进行签名得到第二签名;The at least one second server is configured to sign data of the target data table in the second database according to the target range to obtain a second signature;
    所述计算设备还用于根据所述第一签名和所述第二签名,确定所述第一数据库中目标数据表的数据与所述第二数据库中目标数据表的数据是否相同。The computing device is further configured to determine, according to the first signature and the second signature, whether data of the target data table in the first database is the same as data of the target data table in the second database.
  8. 根据权利要求7所述的系统,其特征在于,所述第一数据库中用于存储所述目标数据表的每个服务器均为运行所述第一服务端的所述第一服务器,所述第一范围包括所述目标数据表的数据在所述第一数据库的每个所述第一服务器的子范围,所述第二数据库中用于存储所述目标数据表的每个服务器均为运行所述第二服务端的所述第 二服务器,所述第二范围包括所述目标数据表的数据在所述第二数据库的每个所述第二服务器的子范围,所述计算设备具体用于:The system according to claim 7, wherein each server in the first database for storing the target data table is the first server running the first server, the first The range includes data of the target data table in a sub-range of each of the first servers of the first database, and each server in the second database for storing the target data table is running The second server a second server, the second range includes a sub-range of the data of the target data table in each of the second servers of the second database, where the computing device is specifically configured to:
    根据所述目标数据表的数据在所述第一数据库的每个所述第一服务器的子范围和所述目标数据表的数据在所述第二数据库的每个所述第二服务器的子范围,确定所述目标范围的子范围,每个所述子范围对应的数据在所述第一数据库中分布在一个服务器上,并且在所述第二数据库中分布在一个服务器上。Depending on the data of the target data table, the sub-range of each of the first servers of the first database and the data of the target data table are in a sub-range of each of the second servers of the second database Determining a sub-range of the target range, data corresponding to each of the sub-ranges is distributed on one server in the first database, and distributed on one server in the second database.
  9. 根据权利要求8所述的系统,其特征在于,在所述第一服务器根据所述目标范围,对所述第一数据库中目标数据表的数据进行签名得到第一签名,所述第二服务器根据所述目标范围,对所述第二数据库中目标数据表的数据进行签名得到第二签名之前,The system according to claim 8, wherein the first server signs the data of the target data table in the first database according to the target range to obtain a first signature, and the second server is configured according to The target range, before the data of the target data table in the second database is signed to obtain the second signature,
    所述计算设备、所述至少一个第一服务器和所述至少一个第二服务器中的至少一种用于为每个所述子范围进行树型分段;At least one of the computing device, the at least one first server, and the at least one second server is configured to perform tree segmentation for each of the sub-ranges;
    所述至少一个第一服务器具体用于:根据所述树型分段,对所述第一数据库中目标数据表的数据的分段进行签名得到树型的所述第一签名;The at least one first server is specifically configured to: according to the tree segment, sign a segment of data of the target data table in the first database to obtain the first signature of the tree type;
    所述至少一个第二服务器具体用于:根据所述树型分段,对所述第二数据库中目标数据表的数据的分段进行签名得到树型的所述第二签名。The at least one second server is specifically configured to: according to the tree segment, sign a segment of data of the target data table in the second database to obtain the second signature of the tree.
  10. 根据权利要求9所述的系统,其特征在于,所述计算设备具体用于:The system of claim 9 wherein said computing device is specifically configured to:
    根据树型的所述第一签名和树型的所述第二签名,确定所述第一签名和所述第二签名的树的相同层的签名是否一致,当签名不一致时,确定所述层对应的分段在所述第一数据库中目标数据表的数据与所述第二数据库中目标数据表的数据不同。Determining, according to the first signature of the tree type and the second signature of the tree type, whether signatures of the same layer of the first signature and the second signature tree are consistent, and when the signatures are inconsistent, determining the layer The data of the corresponding segment in the target data table in the first database is different from the data in the target data table in the second database.
  11. 根据权利要求9或10所述的系统,其特征在于,A system according to claim 9 or 10, characterized in that
    所述至少一个第一服务器和所述至少一个第二服务器用于对所述目标范围中数据的密度进行统计;The at least one first server and the at least one second server are configured to perform statistics on density of data in the target range;
    所述至少一个第一服务器和所述至少一个第二服务器用于根据统计的结果,为每个所述子范围进行树型分段。The at least one first server and the at least one second server are configured to perform tree segmentation for each of the sub-ranges according to a statistical result.
  12. 根据权利要求7至11中任一项所述的系统,其特征在于,所述至少一个第一服务器具体用于:根据所述目标范围,通过哈希算法对所述第一数据库中目标数据表的数据进行签名得到第一签名;The system according to any one of claims 7 to 11, wherein the at least one first server is specifically configured to: target a target data table in the first database by a hash algorithm according to the target range The data is signed to obtain the first signature;
    所述至少一个第二服务器具体用于:根据所述目标范围,通过哈希算法对所述第二数据库中目标数据表的数据进行签名得到第二签名。 The at least one second server is specifically configured to: according to the target range, sign the data of the target data table in the second database by using a hash algorithm to obtain a second signature.
PCT/CN2017/108196 2016-12-30 2017-10-28 Method and system for comparing data of data table WO2018121025A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611260662.8A CN107070645B (en) 2016-12-30 2016-12-30 Method and system for comparing data of data table
CN201611260662.8 2016-12-30

Publications (1)

Publication Number Publication Date
WO2018121025A1 true WO2018121025A1 (en) 2018-07-05

Family

ID=59624007

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/108196 WO2018121025A1 (en) 2016-12-30 2017-10-28 Method and system for comparing data of data table

Country Status (2)

Country Link
CN (1) CN107070645B (en)
WO (1) WO2018121025A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960613A (en) * 2019-03-11 2019-07-02 中国银联股份有限公司 A kind of method and device of data batch processing
CN110287182A (en) * 2019-05-05 2019-09-27 浙江吉利控股集团有限公司 A kind of data comparison method, apparatus, equipment and the terminal of big data
CN112395276A (en) * 2020-11-13 2021-02-23 中国人寿保险股份有限公司 Data comparison method and related equipment
CN112613808A (en) * 2020-12-15 2021-04-06 嘉兴蓝匠仓储系统软件有限公司 Method for reading warehouse-in materials by using RFID (radio frequency identification) group

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107070645B (en) * 2016-12-30 2020-06-16 华为技术有限公司 Method and system for comparing data of data table
CN109739831A (en) * 2018-11-23 2019-05-10 网联清算有限公司 Data verification method and device between database

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102084416A (en) * 2008-02-21 2011-06-01 史诺有限公司 Audio visual signature, method of deriving a signature, and method of comparing audio-visual data
US8744840B1 (en) * 2013-10-11 2014-06-03 Realfusion LLC Method and system for n-dimentional, language agnostic, entity, meaning, place, time, and words mapping
CN104391894A (en) * 2014-11-11 2015-03-04 广州科腾信息技术有限公司 Method for checking and processing repeated data
CN107070645A (en) * 2016-12-30 2017-08-18 华为技术有限公司 Compare the method and system of the data of tables of data

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9002792B2 (en) * 2012-11-19 2015-04-07 Compellent Technologies Confirming data consistency in a data storage environment
CN104111937A (en) * 2013-04-18 2014-10-22 中兴通讯股份有限公司 Master database standby database and data consistency testing and repairing method and device of master database and standby database
CN103646073A (en) * 2013-12-11 2014-03-19 浪潮电子信息产业股份有限公司 Condition query optimizing method based on HBase table
CN104077373B (en) * 2014-06-24 2018-12-04 北京京东尚科信息技术有限公司 A kind of data consistency verification method
CN105677645B (en) * 2014-11-17 2018-12-21 阿里巴巴集团控股有限公司 A kind of tables of data comparison method and device
CN105988889B (en) * 2015-02-11 2019-06-14 阿里巴巴集团控股有限公司 A kind of data verification method and device
CN105989089A (en) * 2015-02-12 2016-10-05 阿里巴巴集团控股有限公司 Data comparison method and device
US9910906B2 (en) * 2015-06-25 2018-03-06 International Business Machines Corporation Data synchronization using redundancy detection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102084416A (en) * 2008-02-21 2011-06-01 史诺有限公司 Audio visual signature, method of deriving a signature, and method of comparing audio-visual data
US8744840B1 (en) * 2013-10-11 2014-06-03 Realfusion LLC Method and system for n-dimentional, language agnostic, entity, meaning, place, time, and words mapping
CN104391894A (en) * 2014-11-11 2015-03-04 广州科腾信息技术有限公司 Method for checking and processing repeated data
CN107070645A (en) * 2016-12-30 2017-08-18 华为技术有限公司 Compare the method and system of the data of tables of data

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960613A (en) * 2019-03-11 2019-07-02 中国银联股份有限公司 A kind of method and device of data batch processing
CN110287182A (en) * 2019-05-05 2019-09-27 浙江吉利控股集团有限公司 A kind of data comparison method, apparatus, equipment and the terminal of big data
CN112395276A (en) * 2020-11-13 2021-02-23 中国人寿保险股份有限公司 Data comparison method and related equipment
CN112613808A (en) * 2020-12-15 2021-04-06 嘉兴蓝匠仓储系统软件有限公司 Method for reading warehouse-in materials by using RFID (radio frequency identification) group

Also Published As

Publication number Publication date
CN107070645A (en) 2017-08-18
CN107070645B (en) 2020-06-16

Similar Documents

Publication Publication Date Title
WO2018121025A1 (en) Method and system for comparing data of data table
US11422853B2 (en) Dynamic tree determination for data processing
US9996593B1 (en) Parallel processing framework
US8832130B2 (en) System and method for implementing on demand cloud database
US9020802B1 (en) Worldwide distributed architecture model and management
US8417991B2 (en) Mitigating reduction in availability level during maintenance of nodes in a cluster
CN106339254B (en) Method and device for quickly starting virtual machine and management node
CN108304554B (en) File splitting method and device, computer equipment and storage medium
US10541936B1 (en) Method and system for distributed analysis
US20060095435A1 (en) Configuring and deploying portable application containers for improved utilization of server capacity
CN108874558A (en) News subscribing method, electronic device and the readable storage medium storing program for executing of distributed transaction
WO2017028394A1 (en) Example-based distributed data recovery method and apparatus
US10185743B2 (en) Method and system for optimizing reduce-side join operation in a map-reduce framework
US10902018B2 (en) Synchronizing in-use source data and an unmodified migrated copy thereof
US10558373B1 (en) Scalable index store
US8667008B2 (en) Search request control apparatus and search request control method
CN107276914B (en) Self-service resource allocation scheduling method based on CMDB
CN111414239B (en) Virtual machine mirror image management method, system and medium based on kylin cloud computing platform
EP3811227B1 (en) Methods, devices and systems for non-disruptive upgrades to a distributed coordination engine in a distributed computing environment
US11157454B2 (en) Event-based synchronization in a file sharing environment
CN110019057B (en) Request processing method and device
CN113535673A (en) Method and device for generating configuration file and processing data
Srinivasan et al. Techniques and Efficiencies from Building a Real-Time DBMS
WO2023019560A1 (en) Data processing method and apparatus, electronic device and computer-readable storage medium
CN113553329A (en) Data integration system and method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17889190

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17889190

Country of ref document: EP

Kind code of ref document: A1