WO2022063223A1

WO2022063223A1 - Data verification method, apparatus, and system

Info

Publication number: WO2022063223A1
Application number: PCT/CN2021/120282
Authority: WO
Inventors: 黄凯耀; 郑云洲; 孟小珍; 李龙; 赵俊; 李志学
Original assignee: 华为技术有限公司
Priority date: 2020-09-28
Filing date: 2021-09-24
Publication date: 2022-03-31
Also published as: CN114281793A

Abstract

A data verification method (100, 200), apparatus, (130, 1000, 1210) and system (100, 1200). The method comprises: generating a first Merkle tree according to a first data table in a first database, generating a second Merkle tree according to a second data table in a second database, wherein the structure and generation method of the first Merkle tree and the second Merkle tree are exactly the same, and therefore, whether the first data table and the second data table are consistent can be determined by directly comparing hash values of root nodes of the two Merkle trees, such that the demand for accurate and fast verification of massive data can be met.

Description

Data verification method, device and system

This application claims the priority of the Chinese patent application with the application number 202011040390.7 and the application name "Data Verification Method, Apparatus and System" filed with the China Patent Office on September 28, 2020, the entire contents of which are incorporated into this application by reference middle.

technical field

The present application relates to the field of storage, and more particularly, to a data verification method, apparatus and system.

Background technique

In the process of data synchronization or data migration of a database (for example, a heterogeneous database or a homogeneous database), it is necessary to verify the data consistency of the synchronization tables of the source database and the target database to verify the correctness of data synchronization or data migration. sex. In practical applications, there is usually a problem that the data in the source database is inconsistent with the data in the target database. On the one hand, in the process of data transmission and data storage, there are data loss and data errors caused by hardware failures, software defects, human errors, environmental interference and other factors, resulting in the data in the source database and the target database. Data is inconsistent. On the other hand, due to the performance problems of the database system, there may be a certain time delay when the changes of the source data table are synchronized to the target database, causing the data in the source database and the data table in the target database to be calibrated at a certain time. Inconsistent test.

Traditional data verification methods cannot meet the needs of accurate and fast verification of massive data.

SUMMARY OF THE INVENTION

The present application provides a data verification method, device and system, which can be applied in the scenarios of data synchronization or data migration, and the method can better meet the requirements for accurate and fast verification of massive data.

A first aspect provides a data verification method, characterized in that the method includes:

The first data table in the first database is processed to generate a first Merkle tree, each row of the first data table includes row identifiers and row data, and the first Merkle tree includes N first leaf nodes, The N first leaf nodes are in one-to-one correspondence with the N first hash buckets, the hash value of each first leaf node is determined according to the corresponding first hash bucket, and the N first hash buckets is obtained by hash partitioning the first data table according to the row identifier, any two first hash buckets are different, and N is a positive integer greater than or equal to 2;

Process the second data table in the second database to generate a second Merkle tree, the second data table is obtained by synchronizing or migrating the first data table to the second database, the second Merkle tree Including N second leaf nodes, the N second leaf nodes are in one-to-one correspondence with N second hash buckets, and the hash value of each second leaf node is determined according to the corresponding second hash bucket, The N second hash buckets are obtained by hash partitioning the second data table according to row identifiers, and hash partitioning the second data table according to row identifiers to obtain the hashes of the N second hash buckets The rule is the same as the hash rule for obtaining the N first hash buckets by hash partitioning the first data table according to the row identifier, and any two second hash buckets are different;

The first Merkle tree is compared with the second Merkle tree to determine whether the first data table is consistent with the second data table.

In the above technical solution, the hash value of each first leaf node of the first Merkle tree corresponding to the first data table is determined according to the corresponding first hash bucket, and the second data table corresponding to the second The hash value of each second leaf node of the Merkle tree is determined according to its corresponding second hash bucket. Because the hashing rule for obtaining N second hash buckets by hash partitioning the second data table according to row identifiers is different from the hashing rule for obtaining N first hash buckets by hashing partitioning the first data table according to row identifiers are the same, so it can be ensured that the data corresponding to the same row identifier in the second data table and the first data table are mapped to the second hash bucket and the first hash bucket with the same sequence number respectively, so that the generated The first Merkle tree has the same structure as the second Merkle tree. Therefore, it can be directly determined whether the first data table and the second data table are consistent by comparing the hash values of the above two Merkle tree root nodes. Therefore, the data verification method provided by the present application can meet the requirements of accurate and fast verification of massive data.

With reference to the first aspect, in some implementations of the first aspect, the first data table includes M rows, where M is a positive integer greater than or equal to 1, and the first data table in the first database is processed to generate the first Merkle tree, including:

Hash processing is performed on the first data table to obtain M first hash groups, the M first hash groups are in one-to-one correspondence with the M rows, and each of the first hash groups includes the first data table A row identifier in and the hash value of the row data corresponding to the one row identifier, the row identifiers included in each of the first hash groups are different;

mapping the M first hash groups to the N first hash buckets;

Determine the hash values of the N first leaf nodes according to the N first hash buckets;

The first Merkle tree is generated according to the hash values of the N first leaf nodes.

It should be understood that, for processing the second data table in the second database, the method for generating the second Merkle tree is the same as the above method. Specifically, when the second data table includes K rows, K is a positive integer less than or equal to N, and the second data table in the second database is processed to generate a second Merkle tree, including:

Hash processing is performed on the second data table to obtain K second hash groups, the K second hash groups are in one-to-one correspondence with the K rows, and each of the second hash groups includes the second data table A row identifier in and the hash value of the row data corresponding to the one row identifier, the row identifiers included in each of the second hash groups are different;

mapping the K second hash groups to the N second hash buckets;

Determine the hash values of the N second leaf nodes according to the N second hash buckets;

The second Merkle tree is generated according to the hash values of the N second leaf nodes.

Wherein, when K is equal to M, it can be understood that the number of rows included in the second data table is the same as the number of rows included in the first data table, that is, in the process of synchronizing or migrating the first data table to the second database If there are no missing data. When K is less than M, it can be understood that the number of rows included in the second data table is the same as the number of rows included in the first data table, that is, in the process of synchronizing or migrating the first data table to the second database, if there is Data is missing.

It should also be understood that the mapping rule for mapping the K second hash groups to the N second hash buckets is the same as the mapping rule for mapping the M first hash groups to the N first hash buckets. The rules are the same.

In the above technical solution, a data partition algorithm is used to map the row data in the data table to hash buckets, and the hash buckets correspond to the Merkle tree leaf nodes one-to-one. Since the K second hash groups are mapped to the Nth hash buckets The mapping rule for two hash buckets is the same as the mapping rule for mapping M first hash groups to N first hash buckets, so the generated first Merkle tree and the second Merkle tree have the same Structure.

With reference to the first aspect, in some implementations of the first aspect, the hash value of each of the first leaf nodes is differentiated according to the hash value included in the first hash group included in the corresponding first hash bucket. The hash value obtained by the OR operation.

It should be understood that the hash value of each second leaf node is a hash value obtained by performing an XOR operation on the hash values included in the second hash group included in the corresponding second hash bucket.

In the above technical solution, the hash value of the leaf node is obtained by performing the XOR operation on the data in the hash bucket, so when the consistency check is performed on the first data table and the second data table, the row data can be avoided. The sorting process is performed, so that the efficiency of data verification can be further improved.

With reference to the first aspect, in some implementations of the first aspect, the first Merkle tree is compared with the second Merkle tree to determine whether the first data table is consistent with the second data table ,include:

determining the hash value of the root node of the first Merkle tree and the hash value of the root node of the second Merkle tree;

If the hash value of the first Merkle root node is the same as the hash value of the second Merkk tree root node, determine that the first data table is consistent with the second data table;

If the hash value of the root node of the first Merkke tree is different from the hash value of the root node of the second Merkke tree, it is determined that the first data table is inconsistent with the second data table.

In the above technical solution, since the structures of the first Merkle tree and the second Merkle tree (for example, the tree height and the row identifier in the data table corresponding to the leaf node) are exactly the same, when the consistency check is performed , it is possible to accurately and quickly determine whether the first data table and the second data table are consistent by judging whether the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree. Specifically, when the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree, it may be determined that the first data table and the second data table are consistent. When the hash value of the root node of the first Merkle tree is different from the hash value of the root node of the second Merkle tree, it may be determined that the first data table is inconsistent with the second data table.

In conjunction with the first aspect, in some implementations of the first aspect, after determining that the first data table is inconsistent with the second data table, the method further includes:

It is determined that the hash value of the ith first leaf node is different from the hash value of the ith second leaf node, and the hash value of the ith first leaf node is based on the ith first hash bucket Determined, the hash value of the ith second leaf node is determined according to the ith second hash bucket, i is a positive integer, and 1≤i≤N;

Compare the i-th first hash bucket and the i-th second hash bucket, and determine the row identifiers that are inconsistent between the first data table and the second data table;

The row data corresponding to the inconsistent row IDs are queried from the first database and the second database respectively according to the inconsistent row IDs.

In the above technical solution, after it is determined that the first data table and the second data table are inconsistent, according to the hash value of the leaf node and the hash bucket corresponding to the leaf node, it is also possible to specifically determine the data in the first database and the second database. Inconsistent row IDs. In one step, the row data corresponding to the inconsistent row IDs can be queried from the first database and the second database according to the determined inconsistent row IDs. Since the size of the data set of the leaf nodes is controllable, the time required for determining the inconsistent row identifiers above is also controllable.

With reference to the first aspect, in some implementations of the first aspect, the first data table includes the full amount of data in at least one data table in the first database.

In conjunction with the first aspect, in some implementations of the first aspect, the first data table includes incremental data in at least one data table in the first database.

In the above technical solution, the full data and the incremental data can be separated, and data verification can be performed as two stages, which can save computational overhead. Specifically, since the full amount of data includes a large amount of data, a Merkle tree with a higher number of layers can be constructed when verifying the full amount of data. Since the amount of data included in the incremental data is small, a Merkle tree with a lower number of layers can be constructed when verifying the incremental data.

In conjunction with the first aspect, in some implementations of the first aspect, the height of the first Merkle tree is associated with the first data table.

In the above technical solution, the height of the first Merkle tree can be adaptively adjusted according to the size of the first data table to be verified.

With reference to the first aspect, in some implementations of the first aspect, the first database and the second database are heterogeneous databases or homogeneous databases.

In the above technical solution, the method for data verification provided by the present application can be applied to the data consistency verification of homogeneous databases and the data consistency verification of heterogeneous databases.

With reference to the first aspect, in some implementations of the first aspect, the first database is a relational database or a non-relational database, and the second database is a relational database or a non-relational database.

In a second aspect, a data verification apparatus is provided, and the data verification apparatus executes the method in the first aspect and any possible implementation manner of the first aspect.

It should be understood that the data verification device provided in the present application is independently decoupled from the database system, so the data verification device will not cause intrusive effects on the database system. For example, it affects the function and performance of the database system or occupies database system resources.

In a third aspect, a data verification device is provided, the device includes a memory and a processor, the memory is used for storing instructions, and the processor is configured to read the instructions stored in the memory, so that the data verification device executes the above-mentioned first A method in an aspect and any possible implementation of the first aspect.

In a fourth aspect, a processor is provided, including: an input circuit, an output circuit, and a processing circuit. The processing circuit is configured to receive a signal through the input circuit and transmit a signal through the output circuit, so that any aspect of the first aspect and the method of any possible implementation of the first aspect are accomplish.

In a specific implementation process, the above-mentioned processor may be a chip, the input circuit may be an input pin, the output circuit may be an output pin, and the processing circuit may be a transistor, a gate circuit, a flip-flop, and various logic circuits. The input signal received by the input circuit may be received and input by, for example, but not limited to, a receiver, the signal output by the output circuit may be, for example, but not limited to, output to and transmitted by a transmitter, and the input circuit and output The circuit can be the same circuit that acts as an input circuit and an output circuit at different times. The embodiments of the present application do not limit the specific implementation manners of the processor and various circuits.

In a fifth aspect, a processing apparatus is provided, including a processor and a memory. The processor is configured to read the instructions stored in the memory, and can receive signals through the receiver and transmit signals through the transmitter, so as to execute the first aspect and the method in any possible implementation manner of the first aspect.

Optionally, there are one or more processors and one or more memories.

Optionally, the memory may be integrated with the processor, or the memory may be provided separately from the processor.

In the specific implementation process, the memory can be a non-transitory memory, such as a read only memory (ROM), which can be integrated with the processor on the same chip, or can be separately set in different On the chip, the embodiment of the present application does not limit the type of the memory and the setting manner of the memory and the processor.

It should be understood that the relevant data interaction process, such as sending indication information, may be a process of outputting indication information from the processor, and receiving capability information may be a process of receiving input capability information by the processor. Specifically, the data output by the processing can be output to the transmitter, and the input data received by the processor can be from the receiver. Among them, the transmitter and the receiver may be collectively referred to as a transceiver.

In a sixth aspect, a computer-readable storage medium is provided for storing a computer program, the computer program comprising instructions for executing the method in the above-mentioned first aspect and any possible implementation manner of the above-mentioned first aspect.

In a seventh aspect, there is provided a computer program product comprising instructions that, when run on a computer, cause the computer to execute the method in the above-mentioned first aspect and any possible implementation manner of the above-mentioned first aspect.

In an eighth aspect, a system is provided, the system including the data verification apparatus described in the second aspect.

In a ninth aspect, a chip is provided, including at least one processor and an interface; the at least one processor is used to call and run a computer program, so that the chip executes the above-mentioned first aspect and the above-mentioned first aspect method in any possible implementation of .

Description of drawings

FIG. 1 is a schematic diagram of a system 100 suitable for the data verification method provided by the present application.

FIG. 2 is a schematic diagram of the data verification apparatus 130 provided by the present application.

FIG. 3 is a schematic flowchart of the data verification method 100 provided by the present application.

FIG. 4 is a schematic diagram of a Merkle tree determined according to the method provided in this application.

FIG. 5 is a schematic diagram of extracting data from the first data table provided by the present application

FIG. 6 is a schematic diagram of hash partitioning the data extracted from the first data table provided by the present application.

FIG. 7 is a schematic diagram of a Merkle tree determined according to the method provided in the present application.

FIG. 8 is a schematic flowchart of a data verification method 200 provided by the present application.

FIG. 9 is a schematic diagram of a Merkle tree determined according to the method provided in the present application.

FIG. 10 is a schematic structural diagram of a data verification apparatus 1000 provided by the present application.

FIG. 11 is a schematic structural diagram of a data verification device 1000 provided by the present application.

FIG. 12 is a schematic structural diagram of a system 1200 provided by the present application.

detailed description

The technical solutions in the present application will be described below with reference to the accompanying drawings.

The terms used in the embodiments of the present application are only used to explain specific embodiments of the present application, and are not intended to limit the present application.

In this application, the terms "first", "second", "third" and other words are used to distinguish the same or similar items that have substantially the same function and function. It should be understood that "first", "second" and "third" There is no logical or temporal dependency between "three", nor does it limit the quantity and execution order.

This application will present various aspects, embodiments, or features around a system that may include a plurality of devices, components, modules, and the like. It is to be understood and appreciated that the various systems may include additional devices, components, modules, etc., and/or may not include all of the devices, components, modules, etc. discussed in connection with the figures. In addition, combinations of these schemes can also be used.

In addition, in the embodiments of the present application, words such as "exemplary" and "for example" are used to represent examples, illustrations or illustrations. Any embodiment or design described in this application as an "example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of the word example is intended to present a concept in a concrete way.

In the embodiments of the present application, "corresponding (corresponding, relevant)" and "corresponding (corresponding)" may sometimes be used interchangeably. It should be noted that, when the difference is not emphasized, the meanings to be expressed are the same.

In the embodiments of the present application, sometimes _a subscript such as W1 may be mistakenly written in a non-subscript form such as W1. When the difference is not emphasized, the meaning to be expressed is the same.

The network architecture and service scenarios described in the embodiments of the present application are for the purpose of illustrating the technical solutions of the embodiments of the present application more clearly, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application. The evolution of the architecture and the emergence of new business scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

References in this specification to "one embodiment" or "some embodiments" and the like mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically emphasized otherwise. The terms "including", "including", "having" and their variants mean "including but not limited to" unless specifically emphasized otherwise.

In this application, "at least one" means one or more, and "plurality" means two or more. "And/or", which describes the relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, it can indicate that A exists alone, A and B exist at the same time, and B exists alone, where A, B can be singular or plural. The character "/" generally indicates that the associated objects are an "or" relationship. "At least one item(s) below" or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one item (a) of a, b, or c can represent: a, b, c, ab, ac, bc, or abc, where a, b, c can be single or multiple .

Below, the related technology of the present application is introduced:

For ease of understanding, before describing the data verification method provided by the present application, related terms involved in the present application are briefly introduced first.

1. Data verify

Data verification is a verification operation to ensure the integrity of the data. A check value is usually calculated on the original data by a specified algorithm. The receiver uses the same algorithm to calculate the check value once. If the check value obtained by the two calculations is the same, it means that the data is consistent.

2. Data replication

Data replication, the technique of copying data from one location to another, involves sharing information to ensure consistency between redundant resources (such as software or hardware components) to improve reliability, fault tolerance or reliability Accessibility.

3. Merkel tree (merkel tree)

Merkle tree can also be called hash tree. A Merkle tree is a binary tree consisting of a root node, a set of intermediate nodes and a set of leaf nodes. The bottommost leaf node contains the stored data or its hash value, each intermediate node is the hash value of the content of its two child nodes, and the root node is also composed of the hash value of the content of its two child nodes. When new data is generated in the data extraction module, the respective Merkle trees will be dynamically updated.

4. Hash

Hash is a function that maps data of arbitrary length into data of fixed length. A slight change in the input data can cause the result of the hash operation to change beyond recognition, and it is generally considered impossible to reverse the characteristics of the original input data based on the hash value.

5. Heterogeneous database (HDB)

Heterogeneous database is a collection of related multiple database systems, which can realize data sharing and transparent access. Each database system already exists before joining the heterogeneous database system, and has its own database management system (database management system, DBMS). Each component of a heterogeneous database has its own autonomy. While realizing data sharing, each database system still maintains its own application characteristics, integrity control and security control.

6. Homogeneous database

Homogeneous database means that all sites use a common DBMS software, and each site understands each other and cooperates to deal with the needs of users.

7. Relational database (RD)

RD refers to a database that uses a relational model to organize data. It stores data in the form of rows and columns so that users can understand it. The series of rows and columns of a relational database are called tables, and a group of tables constitute a database. . A user retrieves data in a database through a query, which is an execution code that defines certain areas of the database. The relational model can be simply understood as a two-dimensional table model, and a relational database is a data organization composed of two-dimensional tables and the relationships between them.

8. Non-relational database (not only sql, NoSQL)

NoSQL is a database that uses a non-relational model to organize data. Non-relational databases can include the following types: key-value store databases (eg, Oracle BDB), column store databases (eg, HBase), document databases (eg, CouchDB or MongoDb), and graph databases.

Usually, in the process of synchronizing data between the source database and the target database, it is necessary to rely on the physical medium or network to transmit the data to be synchronized, so that there is a certain delay between the source database and the target database. In addition, there are also many uncertain influencing factors during the transmission of the data to be synchronized, such as hardware failure, software defect, human error, environmental interference, etc., which may affect the reliability of the data to be synchronized (for example, data loss or data errors, etc.). The above reasons may cause inconsistency between the source database and the target database. Therefore, after the data synchronization is completed, it is necessary to perform a consistency check on the synchronized data stored in the target database, so as to ensure the reliability of the data stored in the target database. At present, the offline data verification method is usually adopted, and the data generated by the operation of the application software is obtained from the production database (ie, the source database) of each independent data source to a unified offline database (ie, the target database). Data check whether the data of each independent data source is consistent. However, since acquiring data from the production database requires database resources, in order to reduce the impact on database performance, the frequency of acquiring data needs to be reduced, even when the business volume is small, which in turn will affect offline data. Check the timeliness. In addition, in the above-mentioned data verification process, the data usually needs to be sorted, and the data sorting process often needs to occupy a large amount of system resources. Therefore, using the above-mentioned offline data verification method, when verifying massive data (for example, TB-level data), it is usually impossible to meet business requirements.

The present application provides a data verification method, device and system, which can better meet the requirements for accurate and rapid verification of massive data.

In order to facilitate understanding, first, with reference to FIG. 1 and FIG. 2 , a system and a data verification device suitable for the data verification method provided by the present application will be introduced in detail.

As shown in FIG. 1 , the system 100 can be used in but not limited to the following scenarios: database data migration scenario or data synchronization scenario. The system 100 may include at least one source database 110 , at least one target database 120 and at least one data verification device 130 . The data verification device 130 is a system on a third-party hardware device independent of the source database 110 and the target database 120, the source database 110 is the database before data migration or replication, and the target database 120 is the data migration or a replicated database.

In this application, the type of the above-mentioned source database 110 and the type of the above-mentioned target database 120 are not specifically limited.

In one example, the above-mentioned source database 110 or the above-mentioned target database 120 may be a relational database. For example, the source database 110 or the target database 120 may be any one of the following relational databases: Oracle, DB2, Microsoft SQL Server, Microsoft Access, MySQL. It should be understood that the type of relational database here is only illustrative and does not constitute any limitation to the system 100 . For example, the relational database may also be other types of relational databases other than those listed above.

In another example, the above-mentioned source-side database 110 or the above-mentioned target-side database 120 may be a non-relational database. For example, the above-mentioned source database 110 or the above-mentioned target database 120 may be any one of the following non-relational databases: NoSQL, Cloudant and MongoDB. It should be understood that the type of the non-relational database here is only illustrative and does not constitute any limitation to the system 100 . For example, the non-relational database may also be other types of non-relational databases other than those listed above.

In yet another example, the source database 110 may be a non-relational database, and the target database 120 may be a relational database. For example, the source database 110 may be a NoSQL database, and the target database 120 may be an Oracle database.

In this application, the source database 110 and the target database 120 may be homogeneous databases. The source database 110 and the target database 120 may also be heterogeneous databases, which are not limited.

In this application, the deployment of the above-mentioned source database 110, the above-mentioned target database 120 and the above-mentioned data verification device 130 in the equipment is not specifically limited, but it is necessary to ensure that the above-mentioned data verification device 130 is independent of the above-mentioned source database. 110 and the system of the above-mentioned target database 120 will suffice.

In one example, the source database 110 may be a physical module or a virtual module deployed on physical device #1, and the target database 120 may be a physical module or virtual module deployed on physical device #2. The testing apparatus 130 may be a physical module or a virtual module deployed on the physical device #3, and the physical device #1, the physical device #2, and the physical device #3 are different devices.

In another example, the source database 110 and the target database 120 may be different physical modules or virtual modules deployed on physical device #1, and the data verification apparatus 130 may be deployed on physical device #2 physical module or virtual module, and physical device #1 and physical device #3 are different devices.

Referring to FIG. 1 , the source database 110 and the target database 120 may interact (eg, data migration or data synchronization, etc.), and the source database 110 and the target database 120 may also interact with the data verification apparatus 130 respectively. After synchronizing or migrating the data in the source database 110 to the target database 120 , the data verification device 130 can extract the data to be verified at the source from the source database 110 and extract the data to be verified at the target from the target database 120 . Verify the data, and perform consistency check on the two extracted data to verify whether the data after data migration or data synchronization is consistent in the source database 110 and the target database 120 . When the data verification device 130 determines that the data to be verified extracted from the source database is inconsistent with the data to be verified extracted from the target database, it can further determine which data is inconsistent. When the data verification apparatus 130 has a storage function, the result of the consistency verification may also be stored in the data verification apparatus 130 .

It should be understood that FIG. 1 is for illustration only and does not constitute any limitation to the system to which the present application applies. For example, the system 100 may further include a larger number of source-end databases 110 and/or target-end databases 120 and/or data verification devices 130 . For example, the data verification apparatus 130 may further include other modules, such as a verification execution module, a source-side data management module to be verified, a target-side data management module to be verified, and the like.

Below, with reference to FIG. 2 , a schematic structural diagram of the data verification apparatus 130 in FIG. 1 provided in the present application will be introduced.

As shown in FIG. 2 , the apparatus 130 may include: a source data extraction module 131 , a source processing module 132 , a target data processing module 133 , a target data extraction module 134 , a comparison module 135 and a storage module 136 . Wherein, the above-mentioned modules may be connected through internal connection paths. For example, the source-side processing module 132 may interact with the comparison module 135 , the source-side data extraction module 131 , and the target-side processing module 133 .

The source-end data extraction module 131 is configured to obtain data from a source-end database (for example, the above-mentioned source-end database 110 ). For example, the source data extraction module 131 can obtain data from the source database 110 in FIG. 1 .

The source-end processing module 131 is configured to acquire data from the source-end data extraction module 131, and perform hash processing and data partition processing on the acquired data.

The target-end processing module 133 is configured to obtain data from the target-end data extraction module 134, and perform hash processing and data partition processing on the obtained data.

The target-end data extraction module 134 is configured to obtain data from a target-end database (eg, the above-mentioned target-end database 120 ). For example, the target data extraction module 134 can obtain data from the target database 120 in FIG. 1 .

The comparison module 135 is configured to acquire the Merkle tree corresponding to the data from the source-end processing module 132 and the target-end processing module 133, and perform data consistency verification based on the acquired Merkle tree. The comparison module 135 is the core module of the above-mentioned data verification device 130 . Specifically, the comparison module 135 may further include a data comparison sub-module and a data reverse check sub-module. Among them, the data comparison sub-module can quickly compare and find inconsistent row data identification data sets through Merkle tree, and the data reverse check sub-module can reversely search detailed data from the database according to the inconsistent row data identification, and finally find inconsistent rows. Identifies the corresponding data value.

The storage module 136 is used to store data and instructions.

It should be understood that FIG. 2 is only for illustration and does not constitute any limitation to the data verification apparatus 130 provided in the present application. For example, the source processing module 132 and the target processing module 133 in the data verification apparatus 130 may also be included in the same processing module. For example, the source data extraction module 131 and the target data extraction module 134 in the data verification apparatus 130 may also be included in the same processing module. For example, when the comparison module 135 in the data verification apparatus 130 has the function of the storage module 136 , the data verification apparatus 130 may also not include the storage module 136 .

Below, the data verification method provided by the present application will be described in detail with reference to FIG. 3 to FIG. 8 .

As shown in FIG. 3 , the method 100 may include steps 110 to 130 . Steps 110 to 130 will be described in detail below. The execution subject of steps 110 to 130 may be the data verification device 130 shown in FIG. 2 .

Step 110: Process the first data table in the first database to generate a first Merkle tree.

The first database can understand the production database, that is, the source database. For example, in one example, the first database may be the source database 110 shown in FIG. 1 .

In this application, the data source and data size included in the first data table are not limited.

In one example, the first data table may include the full amount of data in at least one data table in the first database.

Optionally, the first data table may further include two or even more data tables in the first database. In this case, the first data table can be understood as a data set composed of two or more data tables in the first database.

For example, when the first database includes data table #1, data table #2 and data table #3, the first data table may include all data in data table #1. The first data table may also include all the data in data table #1 and data table #3. The first data table may also include all data in data table #1, data table #2, and data table #3.

In another example, the first data table may include incremental data in at least one data table in the first database. That is to say, the data verification method provided by the present application can also perform consistency verification only on the changed data in the data table.

For example, at time #1, data table #1 and data table #2 are consistent, wherein data table #2 is obtained by copying data table #1. After time #1, part of the data in data table #1 is changed (eg, data is updated, data is increased, or data is decreased, etc.). In this case, the changed data of the above-mentioned data table #1 can be considered as the data included in the first data table.

In this application, each row of the first data table may include row identifiers and row data, and the first Merkle tree includes N first leaf nodes, N first leaf nodes and N first hash buckets one by one Correspondingly, the hash value of each first leaf node is determined according to the corresponding first hash bucket, and the N first hash buckets are obtained by hash partitioning the first data table according to the row identifier. Any two The first hash buckets are different, and N is a positive integer greater than or equal to 2.

In this application, the types of row identifiers included in the first data table are not specifically limited.

In one example, the row ID may be a numeric row ID. For example, a numeric row ID can be "5".

In another example, the above row identifier may also be a string type row identifier. For example, the string-type row identifier can be "Zhang San" or "Li Si", etc.

When the above row identifier is a string type row identifier, before constructing the Merkle tree, the string type row identifier needs to be processed to obtain a hash value corresponding to the string type row identifier.

In this application, the N first hash buckets are obtained by hash partitioning the first data table according to row identifiers, which can be understood as whether the row identifiers of the first data table are numeric row identifiers or string row identifiers When mapping the data in the first data table to the hash bucket, you can first perform a hash operation on the row identifier included in the first data table to obtain a hash value corresponding to the row identifier, and then perform a hash operation on the row identifier corresponding to the row identifier. A modulo operation is performed on the value, and then the hash bucket corresponding to the row where the row identifier is located is determined according to the result of the modulo operation.

The above-mentioned N first leaf nodes correspond to the N first hash buckets one-to-one. It can be understood that the i-th first leaf node (that is, the first leaf node with serial number i) among the N first leaf nodes is the same as the The i-th first hash bucket (that is, the first hash bucket with the serial number i) in the N first hash buckets corresponds to. That is to say, the first leaf node with sequence number i corresponds to the first hash bucket with sequence number i. Among them, the serial number of each first leaf node in the N first leaf nodes is different, the serial number of each first hash bucket in the N first hash buckets is different, and i is greater than or equal to 1 and less than or equal to N positive integer.

The above N first hash buckets are obtained by hash partitioning the first data table according to row identifiers, and the corresponding relationship between the first hash bucket and the first data table is not specifically limited in this application.

In one example, each first hash bucket is determined from a row in the first data table. At this time, each first hash bucket corresponds to a row of the first data table. In this case, the number of row identifiers in the first data table included in each first hash bucket is the same.

In another example, at least one of the N first hash buckets is determined from two or more rows in the first data table. At this time, at least one first hash bucket corresponds to two or more rows of the first data table. In this case, the number of row identifiers in the first data table included in each first hash bucket may be different.

Optionally, at least one of the N first hash buckets may also be empty. That is to say, the N-1 first hash buckets are determined according to all row data included in the first data table, and the remaining one first hash bucket does not include any data in the first data table.

If any two of the above first hash buckets are different, it can be understood that the sequence numbers corresponding to any two first hash buckets are not the same, and the row identifiers in the first data table included in any two first hash buckets are not the same. .

For example, the serial number of hash bucket #1 is 1, and hash bucket #1 includes 2 row identifiers in the first data table, which are "5" and "6" respectively, and the serial number of hash bucket #2 is 2, And the hash bucket #2 includes 1 row identifier in the first data table, which is "1". In this case, hash bucket #1 can be considered to be different from hash bucket #2.

In one example, the first data table may include M rows, where M is a positive integer greater than or equal to N. In this case, the above-mentioned processing of the first data table in the first database to generate the first Merkle tree may include the following steps:

Hash the M lines to obtain M first hash groups, the M first hash groups are in one-to-one correspondence with the M lines, and each first hash group includes a line identifier in the M lines and a line Identifies the hash value of the corresponding row data, and the row identifiers included in each first hash group are different;

Mapping the M first hash groups to the N first hash buckets;

Generate the first Merkle tree according to the hash values of the N first leaf nodes.

In this application, the number of first hash groups included in each of the N first hash buckets is not specifically limited.

For example, the number of first hash groups included in each of the above N first hash buckets may be the same. For example, the number of first hash groups included in each of the above N first hash buckets may also be different. For example, the number of first hash groups included in a part of the first hash buckets in the above N first hash buckets is the same, and the number of first hash groups included in the remaining part of the first hash buckets is different of.

In this application, the hash value of each first leaf node may be a hash value obtained by performing an XOR operation on the hash values included in the first hash group included in the corresponding first hash bucket. It should be understood that when the first hash bucket corresponding to the first leaf node does not include any one of the M first hash groups, the hash value of the first leaf node may be empty. .

The above determination of the hash values of the N first leaf nodes according to the N first hash buckets may include:

When the first hash bucket corresponding to at least one first leaf node among the N first leaf nodes includes at least one first hash group among the M first hash groups, the hash of the at least one first leaf node The value is a hash value obtained by performing an XOR operation on the hash values included in at least one of the M first hash groups included in the corresponding first hash bucket.

Optionally, the first hash bucket corresponding to the at least one first leaf node may further include two or more first hash groups among the M first hash groups.

When the first hash bucket corresponding to at least one first leaf node among the N first leaf nodes does not include one first hash group among the M first hash groups, the hash of the at least one first leaf node value equal to zero.

The height of the first Merkle tree described above is associated with the first data table. Before establishing the first Merkle tree, it is also necessary to determine the relevant parameters of the first Merkle tree according to the size of the first data table, for example, the number of leaf nodes included in the first Merkle tree, and the tree height. Wherein, the tree height of the first Merkle tree will adaptively change with the size of the data included in the first data table. The larger the amount of data included in the first data table, the higher the tree height of the first Merkel number. In other words, when the first data table includes a relatively large amount of data (for example, 1GB), the height of the corresponding first Merk tree is higher than that when the first data table includes a relatively small amount of data (for example, 100MB). The height of a Merkle tree.

Next, taking the Merkle tree shown in FIG. 4 as an example, the above-mentioned method for generating the first Merkle tree according to the hash value of the first leaf node is introduced.

As shown in FIG. 4 , a Merkle tree (ie, an example of the above-mentioned Merkle tree), the tree height of the Merkle tree is 3, the number of intermediate nodes is 2, and the number of leaf nodes is 4. The topmost layer is the root node, the second top layer is the intermediate node, the next layer is the leaf node, and the bottommost layer is the hash bucket described above (ie, an example of the first hash bucket above).

For ease of description, from left to right, the four leaf nodes of the Merkle tree (ie, an example of the first leaf node above) may be respectively marked as: leaf node 1, leaf node 2, leaf node 3 and leaf node 4. From left to right, the four hash buckets of the Merkle tree can be marked as: hash bucket 1, hash bucket 2, hash bucket 3, and hash bucket 4. Among them, the four leaf nodes of the Merkle tree correspond to the four hash buckets one-to-one. Specifically, leaf node 1 corresponds to hash bucket 1, leaf node 2 corresponds to hash bucket 2, leaf node 3 corresponds to hash bucket 3, and leaf node 4 corresponds to hash bucket 4.

The hash value of each leaf node of the Merkle tree is obtained by performing an XOR operation on the hash value included in the hash bucket corresponding to each leaf node. For example, the hash value of leaf node 1 is obtained by XOR operation according to the hash value included in hash bucket 1, that is, the hash value of leaf node 1 can be expressed as N0=XOR(1,5), XOR(1, 5) Indicates the result of performing the XOR operation on the hash value with the row ID of 1 (ie, 0xeffe898) and the hash value with the row ID of 5 (ie, 0xb8b8dd) included in the hash bucket 1.

The hash value of each intermediate node of the Merkle tree is obtained by hashing the hash values of its two child nodes. For example, N4=H(N0, N1) represents the hash value of an intermediate node of the Merkle tree, and H(N0, N1) represents the hash value of the two leaf nodes of this intermediate node (ie, N0 and N1 ) is the result of the hash operation.

The hash value of the root node of the Merkle tree is obtained by hashing the hash values of its two child nodes. For example, H(N4,N5) represents the hash value of the root node of the Merkle tree.

The i-th leaf node #1 above can be understood as the leaf node #1 with the serial number i, and the i-th hash bucket #1, which can be understood as the hash bucket #1 with the serial number i, i=1,2 ,3,4.

It should be understood that FIG. 4 is for illustration only and does not constitute any limitation to the present application. For example, in some implementations, the Merkle tree shown in FIG. 4 may also include a greater number of leaf nodes. For example, in some implementations, the Merkle tree shown in FIG. 4 may also have a higher tree height.

Before step 110, the following operation may also be included: acquiring the first data table from the first database.

For ease of description, a method for acquiring the first data table from the first database will be specifically described below with reference to FIG. 5 and FIG. 6 . It should be understood that FIG. 5 and FIG. 6 are for illustration only, and do not constitute any limitation to the method for obtaining the first data table in the present application.

FIG. 5 is a schematic diagram of extracting data from the first data table provided by the present application. It should be understood that FIG. 5 is only an example. For example, a greater number (eg, 100 rows) or a lesser number (eg, 4 rows) of row data may also be included in data table #1. For example, the data extraction module may also include a higher number of threads.

As shown in FIG. 5 , the execution body for extracting data from the first data table may be a data extraction module. Specifically, the data extraction module may be the source-end data extraction module 131 and the target-end data extraction module 134 shown in FIG. 2 . That is, the source-side data extraction module 131 and the target-side data extraction module 134 in FIG. 2 have the data extraction function described below.

In one example, extracting data from the first data table may include, but is not limited to, the following steps:

Divide the first data table into S batches of data by row, where S is a positive integer greater than or equal to 1;

Use S processing threads to process S batches of data, and S processing threads correspond to S batches of data one-to-one;

The processed S batches of data are enqueued into S queues, and the S batches of data are in one-to-one correspondence with the S queues.

The number of the above processing threads may be set according to the size of the first data table. For example, when the first data table is larger, a larger number of processing threads may be set. For example, when the first data table is smaller, a smaller number of processing threads may be provided.

In the above technical solution, row data can be extracted from the first data table in batches, each batch of data can be processed by a separate thread, these threads can be executed in parallel, and the extracted data is put into the corresponding data queue .

Optionally, the same processing thread can also be used to process the data in the data table.

In an example, the above-mentioned data extraction module may be the source-side data extraction module 131 in FIG. 2 , and the above-mentioned data extraction module may be the target-side data extraction module 134 in FIG. 2 . That is to say, the source-side data extraction module 131 and the target-side data extraction module 134 in FIG. 2 have the functions of the above-mentioned data extraction modules.

Referring to Figure 5, there are 8 pieces of data in data table #1 (ie, an example of the first data table), and the first to fourth pieces of data can be regarded as the first batch of data, and the fifth to eighth pieces of data can be regarded as the first batch of data. Second batch of data. The data extraction module may include two threads responsible for extracting data, namely thread #1 and thread #2, thread #1 may be responsible for extracting the first batch of data, and thread #2 may be responsible for extracting the second batch of data. Thread #1 puts the extracted data into queue #1, and thread #2 puts the extracted data into queue #2. The data extraction threads, that is, the above-mentioned thread #1 and the above-mentioned thread #2 can be executed in parallel to improve the extraction efficiency.

FIG. 6 is a schematic diagram of hash partitioning data extracted from the first data table according to row identifiers provided by the present application.

As shown in FIG. 6 , the execution subject for hash partitioning the data extracted from the data table #1 (ie, an example of the first data table) according to row identifiers may be a data processing module. Specifically, the data processing module may be the source-end processing module 132 and the target-end data processing module 133 shown in FIG. 2 . That is, the source-side processing module 132 and the target-side data processing module 133 in FIG. 2 have the hash partitioning function described below. Referring to FIG. 6 , data table #1 includes 8 pieces of data, and the row identifiers corresponding to these 8 pieces of data are 1, 2, 3, . . . , 8 respectively.

In the embodiment of the present application, the process of hash partitioning data table #1 according to row identifiers, and mapping the result after hash partitioning to N (N=4) hash buckets #1 may be as follows:

First, perform hash processing on the row data of each row of data included in data table #1 to obtain the hash value corresponding to the row data, and then queue the obtained row data corresponding to the hash value and the corresponding row ID to the hash value in order Data queue #1. For convenience of description, each row in the hash data queue #1 may be recorded as one hash group (ie, an example of the above-mentioned first hash group). For example, the row ID of the first hash group in hash data queue #1 is 1, and the stored hash value is 0xffe898. The row ID of the 5th hash group in hash data queue #1 is 1, and the stored hash value is 0xb8bdd. Specifically, refer to FIG. 6 , which is not exemplified one by one here.

Optionally, in some implementation manners, a hash operation may also be performed on the row identifier and row data of each row of data to obtain a hash value corresponding to the row identifier and a hash value corresponding to the row data.

Then, a modulo operation is performed on the row identifier of the hash data queue #1, and the hash bucket corresponding to the row where the row identifier is located is determined according to the result of the row identifier modulo operation. Specifically, the hash data queue #1 includes 8 hash groups, the result obtained by identifying the row of the first hash group with 1 modulo 4 is 1, and the row identification of the fifth hash group with 5 modulo 4 The obtained result is 1, so the first hash group and the fifth hash group can be transferred to the first hash bucket #1 (that is, the hash bucket with the serial number of 1). That is to say, the row data with the row ID of 1 included in the data table #1 is mapped into the hash bucket #1 with the serial number of 1. Similarly, the above processing can be performed on other hash groups in the hash data queue #1, and it can be obtained that the second hash group and the sixth hash group are mapped to the second hash bucket #1, the third hash group The 1st hash group and the 7th hash group are mapped to the 3rd hash bucket #1, and the 4th hash group and the 8th hash group are mapped to the 4th hash bucket #1.

It should be understood that FIG. 6 is only an example. For example, a greater number (eg, 8) or a lesser number (eg, 2) of hash bucket #1 may also be included. For example, a greater number (eg, 100 rows) or a lesser number (eg, 4 rows) of row data may also be included in data table #1.

Step 120: Process the second data table in the second database to generate a second Merkle tree.

The second database can be understood as the target database. For example, in one example, the second database may be the target database 120 shown in FIG. 1 .

In this application, the second data table is obtained by synchronizing or migrating the first data table to the second database, and the second Merkle tree includes N second leaf nodes, N second leaf nodes and N second leaf nodes. Hash buckets are in one-to-one correspondence, the hash value of each second leaf node is determined according to the corresponding second hash bucket, and the N second hash buckets are obtained by hash partitioning the second data table according to row identifiers , the hash rule for hash partitioning the second data table according to the row ID to obtain N second hash buckets and the hash partitioning for the first data table according to the row ID to obtain N first hash buckets The rules are the same, and any two second hash buckets are not the same.

In this application, the second Merkle tree includes N first leaf nodes, and the first Merkle tree also includes N first leaf nodes. Since the two Merkle trees include the same number of leaf nodes, it can be considered that the first Merkle tree and the second Merkle tree have the same tree height. That is to say, the first Merkle tree and the second Merkle tree provided by the present application have the same tree height.

The above-mentioned N second leaf nodes are in one-to-one correspondence with the N second hash buckets. The i-th second hash bucket (that is, the second hash bucket with the serial number i) among the N second hash buckets corresponds to. That is to say, the second leaf node with the sequence number i corresponds to the second hash bucket with the sequence number i. Among them, the serial number of each second leaf node in the N second leaf nodes is different, the serial number of each second hash bucket in the N second hash buckets is different, and i is greater than or equal to 1 and less than or equal to N positive integer.

The hash value of each second leaf node is determined according to the corresponding second hash bucket. For a specific determination method, refer to the method for determining the first leaf node according to the corresponding first hash bucket in step 110 .

The above-mentioned N second hash buckets are obtained by hash partitioning the second data table according to row identifiers, and the corresponding relationship between the second hash bucket and the second data table is not specifically limited in this application.

In one example, each second hash bucket is determined from a row in the second data table.

In another example, at least one of the N second hash buckets is determined from two or more rows in the second data table.

In yet another example, at least one second hash bucket among the above N second hash buckets may also be empty.

The above N second leaf nodes correspond to the N second hash buckets one-to-one, which can be understood as the i-th second leaf node in the N second leaf nodes and the i-th second leaf node in the N second hash buckets. corresponding to each hash bucket, i is a positive integer greater than or equal to 1 and less than or equal to N. It should also be understood that the sequence numbers of each of the N second leaf nodes are different.

The above-mentioned second data table is obtained by synchronizing or migrating the first data table to the second database, and may include the following situations:

If there is no data missing during the process of synchronizing or migrating the first data table to the second database, the number of rows included in the second data table is the same as the number of rows included in the first data table.

For example, the first data table includes 10 rows, and each row includes row identifiers and row data. During the process of synchronizing or migrating a data table to the second data table, if there is no data missing, the second data table can be obtained after synchronization or migration. The data table also includes 10 rows.

If there is data missing in the process of synchronizing or migrating the first data table to the second database, the number of rows included in the second data table is the same as the number of rows included in the first data table.

For example, the first data table includes 10 rows, and each row includes row identifiers and row data. During the process of synchronizing or migrating a data table to the second data table, if one row of data is missing, the second data table will be obtained after synchronization or migration. The data table includes 9 rows.

That is to say, the number of rows included in the second data table in this application may be the same as the number of rows included in the first data table, or the number of rows included in the second data table may also be smaller than the number of rows included in the first data table. .

The above hashing rule for hash partitioning the second data table according to row identifiers to obtain N second hash buckets and the hashing rule for hashing partitioning the first data table according to row identifiers to obtain N first hash buckets It can be understood that since the hash rules for partitioning the second data table and the first data table according to row identifiers are the same, it can be guaranteed that the same row identifiers in the second data table and the first data table are The corresponding data will be mapped to the hash bucket with the same label.

If any two second hash buckets are different, it can be understood that the sequence numbers corresponding to any two second hash buckets are different, and the row identifiers included in any two non-empty second hash buckets are different.

In this application, when the second data table includes K rows and the first data table includes M rows, K is a positive integer less than or equal to M, and the second data table in the second database is processed to generate a second Merck Er tree, which can include:

Hash the K rows to obtain K second hash groups, the K second hash groups are in one-to-one correspondence with the K rows, and each second hash group includes a row identifier in the K rows and a row identifying the hash value of the corresponding row data, and the row identifiers included in each second hash group are different;

According to mapping K second hash groups to N second hash buckets;

A second Merkle tree is generated according to the hash values of the N second leaf nodes.

It should be understood that the mapping rule for mapping K second hash groups to N second hash buckets is the same as the mapping rule for mapping M first hash groups to N first hash buckets, that is, to The hash rule for obtaining N second hash buckets by hash partitioning the second data table according to row identifiers is the same as the hash rule for obtaining N first hash buckets by hash partitioning the first data table according to row identifiers.

For example, the row identifier included in the first hash group with the sequence number 1 is 1 and the hash value of the corresponding row data, and the first hash group with the sequence number 1 is mapped to the first hash bucket with the sequence number 5. When the second data table includes a row with a row ID of 1, the row with the row ID of 1 corresponds to a second hash group with a sequence number of 1, and the second hash group with a sequence number of 1 is mapped to a second hash group with a sequence number of 5. Hash bucket.

It should also be understood that when the second hash bucket corresponding to the hash value of the second leaf node is empty, the hash value of the second leaf node is also empty.

Before step 120, it may also include acquiring a second data table from a second database.

The present application does not specifically limit the manner of acquiring the second data table.

For example, a check mark can be entered in the first data table of the first database, and in the process of copying the second database from the first database, when the above check mark is detected in the second database, the check mark can be marked in the second database. The data with the above-mentioned check mark is used as the data in the second data table.

Wherein, the content not described in detail in step 120 is the same as the content described in the foregoing step 110. For details, refer to the foregoing step 110, which will not be described in detail here.

Step 130: Compare the first Merkle tree with the second Merkle tree to determine whether the first data table is consistent with the second data table.

Whether the first data table is consistent with the second data table can be understood as the number of row identifiers stored in the first data table and the second data table is the same, and the content of the row data corresponding to the same row identifier is also the same.

In this application, comparing the first Merkle tree with the second Merkle tree to determine whether the first data table is consistent with the second data table may include:

Determine the hash value of the root node of the first Merkle tree and the hash value of the root node of the second Merkle tree;

If the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree, it is determined that the first data table is consistent with the second data table;

If the hash value of the root node of the first Merkle tree is different from the hash value of the root node of the second Merkk tree, it is determined that the first data table is inconsistent with the second data table.

In the above technical solution, since the hash value of each node of the Merkle tree is obtained by performing hash operation on the child nodes of each node, for example, the hash value of the root node is determined according to the two intermediate nodes corresponding to the root node. Yes, the hash value of the leaf node is determined according to the data in the hash bucket corresponding to the leaf node. Therefore, when the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree, it can be considered that the first data table and the second data table are consistent. When the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree, it may be considered that there is a difference between the first data table and the second data table, that is, they are inconsistent.

Optionally, after it is determined that the first data table is inconsistent with the second data table, the following operations may also be included:

It is determined that the hash value of the ith first leaf node is different from the hash value of the ith second leaf node, and the hash value of the ith first leaf node is determined according to the ith first hash bucket , the hash value of the ith second leaf node is determined according to the ith second hash bucket, i is a positive integer, and 1≤i≤N;

Compare the i-th first hash bucket with the i-th second hash bucket, and determine the row identifiers that are inconsistent between the first data table and the second data table;

It is determined above that the hash value of the i-th first leaf node is different from the hash value of the i-th second leaf node. If the hash value of the second leaf node is compared and it is determined that the hash value is the same, continue to compare the hash value of the leaf node with serial number 2, and so on, until the hash value of the first leaf node with serial number i is determined It is not the same as the hash value of the second leaf node with sequence number i. The first leaf node with serial number i can be understood as the ith first leaf node, and the second leaf node with serial number i can be understood as the ith second leaf node. For example, it is also possible to compare the hash value of the first leaf node with the serial number N and the hash value of the second leaf node with the serial number N. If it is determined that the hash values are the same, continue to compare the hash value of the serial number N-1. The hash value of the leaf node, and so on, until it is determined that the hash value of the first leaf node with sequence number i is not the same as the hash value of the second leaf node with sequence number i. The above-mentioned comparison of the i-th first hash bucket and the i-th second hash bucket to determine the row identifiers that are inconsistent between the first data table and the second data table may include:

It is determined that the row identifier included in a hash group in the i-th first hash bucket is the same as the row identifier included in a hash group in the i-th second hash bucket, but the corresponding hash values are different;

It is determined that the above row identifier is a row identifier that is inconsistent between the first data table and the second data table.

In the above technical solution, after the inconsistent row identifiers are located, the inconsistent row data needs to be further located. In this embodiment, the process of locating inconsistent row data may be as follows: first, row data identifiers are extracted from inconsistent row data, and corresponding row data are searched from the source database and the target database respectively through the row data identifiers, The row data contains the data of each row, and the inconsistent row data is found by means of direct ratio.

It can be understood that, when the above data comparison module performs data consistency comparison, it is a top-down comparison process. After finding specific inconsistent data, the data storage module stores the information in a non-volatile storage medium for query when needed.

As an example, in conjunction with the two Merkle trees shown in FIG. 7 , the following describes the process of comparing the two Merkle trees for consistency according to the method provided in the foregoing step 130 .

The two Merkle trees shown in FIG. 7 can be obtained according to the above-mentioned

steps

110 and 120. For the convenience of description, these two Merkle trees are denoted as Merkle tree #1 (that is, the above-mentioned first An example of a Merkle tree) and Merkle tree #2 (ie, an example of the second Merkle tree described above).

Below, the hash value of each node is described by taking Merkle tree #1 as an example. Similarly, the hash value of each node of Merkle tree #2 is obtained by a similar method. Four leaf nodes #1 of Merkle tree #1 (ie, an example of the first leaf node) correspond to four hash buckets #1 (ie, an example of the first hash bucket). From left to right, The four leaf nodes #1 are respectively recorded as the first leaf node #1, the second leaf node #1, the third leaf node #1, and the fourth leaf node #1. From left to right, record these 4 hash buckets #1 as the first hash bucket #1, the second hash bucket #1, the third hash bucket #1, and the fourth hash bucket# 1. Each hash bucket #1 includes 2 hash groups, and each hash group includes a row ID and a hash value. The hash value of the first leaf node #1 is obtained by XORing the hash values included in the two hash groups in the first hash bucket #1, that is, the hash value of the first leaf node #1 The Greek value N0=XOR(1,5), where XOR(1,5) means to perform an XOR operation on the hash value corresponding to the row ID 1 and the hash value corresponding to the row ID 5, that is, XOR(1, 5) is equal to XOR(011,101)(011,101). The hash value of intermediate node #1 is obtained by hashing the hash values of its two child nodes. For example, N4=H(N0, N1) represents the hash value of an intermediate node #1 of Merkle tree #1, and H(N0, N1) represents the hash value of the two leaf nodes #1 of this intermediate node #1 The result of hashing the hash values (ie, N0 and N1). The hash value of root node #1 is obtained by hashing the hash values of its two child nodes. For example, H(N4, N5) represents the hash value of root node #1 of Merkle tree #1. Referring to FIG. 7 , the process of performing consistency check on Merkle tree #1 and Merkle tree #2 may be: first, compare whether the hash values of the root nodes are the same. Since the 4th hash bucket #1 of Merkle tree #1 is not the same as the 4th hash bucket #2 of Merkle tree #2 (that is, the hash value corresponding to the hash group with row ID "6" are not the same), so the hash value of root node #1 of Merkle tree #1 (that is, H(N4, N5) in Merkle tree #1) is the same as that of root node #2 of Merkle tree #2. The hash values (i.e. H(N4, N5) in Merkle tree #2) are not the same. Next, a top-down search is performed for leaf nodes that determine that Merkle tree #1 is inconsistent with Merkle tree #2. As can be seen from FIG. 7 , it can be determined that the fourth leaf node #1 is different from the fourth leaf node #2. Then, according to the hash buckets corresponding to the fourth leaf node #1 and the fourth leaf node #2, it is determined that the inconsistent row identifier is "6".

It should be understood that FIG. 7 is only for illustration and does not constitute any limitation to the present application. For example, each of hash bucket #1 and hash bucket #1 shown in FIG. 7 may include a different number of hash groups. For example, the hash values included in hash bucket #1 and hash bucket #1 shown in FIG. 7 may also be hash values of a larger magnitude.

In this application, the types of the first database and the second database involved in the above steps 110 to 130 are not specifically limited.

Optionally, the first database and the second database may be heterogeneous databases.

Optionally, the first database and the second database may be homogeneous databases.

Optionally, the first database may be a relational database or a non-relational database, and the second database may be a relational database or a non-relational database.

For example, the first database may be a relational database, and the second database may be a non-relational database. For example, the first database may be a relational database, and the second database may be a relational database.

Relational databases usually store data in tabular form, so data can be directly extracted from relational databases to construct the first data table, while non-relational databases are usually in non-tabular form (for example, documents, key-value or graph structures, etc.) Therefore, before extracting data from the non-relational database to construct the first data table, it is also necessary to convert the data to be verified in the non-relational database into the form of table storage, wherein each row of the table may include a row ID and one or more row data.

In most scenarios, the data in the production database (that is, an example of the above-mentioned first database) changes dynamically. During the process of data verification, the verified data changes again, which needs to be verified again. . If the method of updating on the original Merkle tree is adopted, the update of each leaf will cause the recalculation of the hash values of the layers above it. When the number of layers of the Merkle tree is high, the calculation overhead will increase sharply. Another factor to consider is that when the incremental data increases rapidly, considering the overall verification performance, the original Merkle tree needs to be expanded and rebuilt, which will affect the verified data.

The data verification method provided by this application can also verify the full data and the incremental data respectively. Specifically, a Merkle tree with a higher number of layers is used for verification on the full amount of data, and a Merkle tree with a lower number of layers is used for verification on the incremental data, thereby effectively reducing computational overhead.

In this application, the method for acquiring incremental data to be verified from the data table is not specifically limited. For example, an existing method for acquiring incremental data to be verified in a data table may be used. For example, other methods for acquiring incremental data to be verified may also be used.

As an example, data table #1 (ie, an example of the first data table described above) includes 10 rows of data, data table #2 (ie, an example of the second data table described above) includes 10 rows of data and data table #2 is a pair of data Table #1 was reproduced. At time #1, the methods from steps 110 to 130 above can be used to generate Merkle tree #1 (ie, an example of the first Merkle tree above) according to data table #1, and Merkle tree #2 to be generated according to data table #2 Merkle tree #2 (ie, an example of the above-mentioned second Merkle tree), and by comparing Merkle tree #1 and Merkle tree #2, it is determined whether data table #1 and data table #2 are consistent. At a time after time #1, the data in rows 5 to 10 in data table #1 is updated, and at this time, the updated data table #1 can be recorded as data table #3 (that is, the above-mentioned first data An example of a table), data table #4 (that is, an example of the above-mentioned second data table) is obtained by duplicating data table #3. In this case, the methods of steps 110 to 130 above may be used to perform consistency check only on incremental data. Specifically, Merkle tree #3 (that is, an example of the above-mentioned first Merkle tree) is generated according to data table #3, and Merkle tree #4 (that is, the above-mentioned second Merkle tree) is generated according to data table #4 An example of a tree), and by comparing Merkle tree #3 and Merkle tree #4 to determine whether Data Table #3 and Data Table #4 are consistent.

The data verification method provided by the present application can better meet the requirements for accurate and fast verification of massive data in different scenarios (eg, online data or offline data). In the above technical solution, the hash value of each first leaf node of the first Merkle tree corresponding to the first data table is determined according to the corresponding first hash bucket, and the second data table corresponding to the second The hash value of each second leaf node of the Merkle tree is determined according to its corresponding second hash bucket. Because the hashing rule for obtaining N second hash buckets by hash partitioning the second data table according to row identifiers is different from the hashing rule for obtaining N first hash buckets by hashing partitioning the first data table according to row identifiers are the same, so it can be ensured that the data corresponding to the same row identifier in the second data table and the first data table are mapped to the second hash bucket and the first hash bucket with the same sequence number respectively, so that the generated The first Merkle tree has the same structure as the second Merkle tree. Therefore, it can be directly determined whether the first data table and the second data table are consistent by comparing the hash values of the above two Merkle tree root nodes. When generating the first Merkle tree and the second Merkle tree, since the hash values of the leaf nodes of the above two Merkle trees are obtained by performing the XOR operation on the data in the corresponding hash buckets, so When the consistency check is performed on the first data table and the second data table, sorting processing of row data can be avoided.

When it is determined that the hash values of the above two Merkle tree root nodes are inconsistent, the data verification method provided by the present application can further determine the inconsistent row identifiers by comparing the hash values of the above two Merkle tree root nodes and the corresponding row data. In order to meet the needs of data verification in different scenarios, the data verification method provided by the present application can also adaptively adjust the tree heights of the first Merkle tree and the second Merkle tree according to the size of the data set to be verified.

Next, the verification process of the data verification method 200 provided by the present application will be described in detail with reference to FIG. 8 .

As shown in FIG. 8 , the method 200 includes steps 210 to 290 , and the steps 210 to 290 are described below. The execution subject of steps 210 to 290 may be the data verification apparatus 130 shown in FIG. 2 .

Step 210, start.

The above-mentioned step 210 indicates that the data consistency check is started.

In step 220, check mark bits are added to the data to be replicated in database #1 (ie, an example of the first database in the above method 100).

The method of inserting a check mark into the data to be copied may be the same as the existing method, and details are not described herein again.

Step 230: Database #2 (ie, an example of the second database in the above-mentioned method 100) replicates the above-mentioned data to be replicated.

Step 240, database #1 detects the flag bit and acquires data table #1.

The method of acquiring the data table #1 after detecting the flag bit from the database #1 can be the same as the existing method, and details are not described here.

Step 250, database #2 detects the flag bit, and acquires data table #2.

Step 260, according to the data table #1, generate a Merkle tree #1 (that is, an example of the first Merkle tree in the above method 100).

Step 270, according to data table #2, generate Merkle tree #2 (ie, an example of the second Merkle tree in the above method 100)

The method for determining the Merkle tree in the above steps 260 and 270 is the same as the method for determining the Merkle tree in the method 100. For details, refer to the above step 110, which will not be described in detail here.

Step 280: Determine whether the hash value of the root node of Merkle tree #1 is the same as the hash value of the root node of Merkle tree #2.

The above determines whether the hash value of the root node of Merkle tree #1 is the same as the hash value of the root node of Merkle tree #2, including:

When it is determined that the hash value of the root node of Merkle tree #1 is the same as the hash value of the root node of Merkle tree #2, step 281 is performed;

If it is determined that the hash value of the root node of Merkle tree #1 is not the same as the hash value of the root node of Merkle tree #2, step 282 and step 283 are performed.

Step 281, it is determined that the data table #1 and the data table #2 are consistent.

Step 282: Compare the hash value of the leaf node of Merkle tree #1 with the hash value of the leaf node of Merkle tree #2, and determine inconsistent row identifiers.

For details, refer to the method in step 130 above, which will not be described in detail here.

In this application, the execution sequence of step 281 and step 282 is not specifically limited.

For example, after step 280 is performed, step 281 may be performed first and then step 282 may be performed. For example, after step 280 is performed, step 282 may be performed first and then step 281 may be performed.

Step 283: Determine inconsistent row data from database #1 and database #2 according to the above determined inconsistent row identifiers.

Optionally, after step 283, the determined inconsistent row identifiers and corresponding row data may also be stored in a storage module of the data verification apparatus, for example, the storage module 136 shown in FIG. 2 .

Step 290, end.

The above step 290 indicates ending the data consistency check.

It should be understood that the above-mentioned FIG. 8 is only for illustration and does not impose any limitation on the data verification process provided by the present application. For example, in some implementations, steps 282 and 283 may not be performed after it is determined that data table #1 and data table #2 are inconsistent.

As shown in FIG. 9 , four Merkle trees are included, namely Merkle tree #3 (that is, an example of the first Merkle tree above) and Merkle tree #4 (that is, the first Merkle tree described above). Another example of Merkle tree), Merkle tree #5 (that is, an example of the above-mentioned second Merkle tree), Merkle tree #6 (that is, another example of the above-mentioned second Merkle tree). Among them, Merkle #3 and Merkle #5 trees have a height of 3, and Merkle #4 and Merkle #6 trees have a height of 2. For specific descriptions about Merkle tree #3, Merkle tree #4, Merkle tree #5, and Merkle tree #6, please refer to the content described in FIG. 7 above, and will not be repeated here.

Among them, Merkle tree #3 can be understood as a Merkle tree generated at time #1 based on the full amount of data in data table #1 (that is, an example of the first data table above). Merkle tree #4 can be understood as the Merkle tree generated according to the incremental data in data table #1 at time #2, time #2 is a time after time #1, and at time #2 data table The incremental data in #1 includes row data corresponding to the following row identifiers in data table #1: "2", "6", "7", and "8". Merkle tree #5 can be understood as a Merkle tree generated at time #1 from the full amount of data in data table #2 (that is, an example of the second data table above). Merkle tree #6 can be understood as a Merkle tree generated according to the incremental data in data table #2 at time #2, and the incremental data in data table #1 at time #2 includes data table #2 The corresponding row data are identified in the following rows: "2", "6", "7" and "8". The data table #2 is a data table obtained after migrating or synchronizing the data table #1.

For the method of generating the above Merkle tree #3, Merkle tree #4, Merkle tree #5, and Merkle tree #6, reference may be made to method 100, which will not be described in detail here.

When it is necessary to check the consistency of data table #1 and data table #2, including:

By comparing Merkle tree #3 and Merkle tree #5, it can be determined whether the full amount of data in data table #1 and data table #2 is consistent;

It can be determined whether the incremental data in data table #1 and data table #2 are consistent by comparing Merkle tree #4 and Merkle tree #6.

Optionally, after determining that the full amount of data in data table #1 and data table #2 is inconsistent, you can also determine inconsistent rows by comparing the leaf nodes of Merkle tree #3 and the leaf nodes of Merkle tree #5. and the row data corresponding to the inconsistent row ID is determined from the data table #1 and the data table #2 according to the inconsistent row ID.

Optionally, after determining that the incremental data in data table #1 and data table #2 are inconsistent, you can also determine the inconsistent data by comparing the leaf nodes of Merkle tree #4 and the leaf nodes of Merkle tree #6. row identifiers, and row data corresponding to the inconsistent row identifiers is determined from data table #1 and data table #2 according to the inconsistent row identifiers.

In the embodiment of the present application, an independent Merkle treelet is used to perform data verification on incremental data, which can further save computational overhead and improve the efficiency of data consistency verification.

The data verification method provided by the present application and the system architecture suitable for the method are described in detail above with reference to FIG. 1 to FIG. 9 . Below, the data verification device, data verification device and data verification system provided by the present application will be described in detail with reference to FIG. 10 to FIG. 12 . It should be understood that the descriptions of the method embodiments correspond to the descriptions of the apparatus embodiments. Therefore, for the parts not described in detail, reference may be made to the foregoing method embodiments.

In this embodiment of the present application, the data verification device should include a processing unit and a determination unit. The data verification device may be the data verification device 130 above.

Optionally, in some implementation manners, the data verification apparatus may further include a transceiver unit.

In the following, with reference to FIG. 10 , the data verification device includes a processing unit and a determination unit as an example for introduction.

As shown in FIG. 10 , the apparatus 1000 includes: a processing unit 1001 and a determination unit 1002 .

The processing unit 1001 is configured to process the first data table in the first database to generate a first Merkle tree, each row of the first data table includes a row identifier and row data, and the first Merkle tree includes N The N first leaf nodes are in one-to-one correspondence with the N first hash buckets, and the hash value of each first leaf node is determined according to the corresponding first hash bucket. The first hash bucket is obtained by hash partitioning the first data table according to the row identifier, any two first hash buckets are different, and N is a positive integer greater than or equal to 2;

The processing unit 1001 is further configured to process a second data table in the second database to generate a second Merkle tree, where the second data table is obtained by synchronizing or migrating the first data table to the second database , the second Merkle tree includes N second leaf nodes, the N second leaf nodes are in one-to-one correspondence with the N second hash buckets, and the hash value of each second leaf node is based on the corresponding Determined by the second hash bucket, the N second hash buckets are obtained by hash partitioning the second data table according to the row ID, and the N second data table is obtained by hash partitioning the second data table according to the row ID The hash rule of the second hash bucket is the same as the hash rule of the N first hash buckets obtained by hash partitioning the first data table according to the row identifier, and any two second hash buckets are different;

The determining unit 1002 is configured to compare the first Merkle tree with the second Merkle tree to determine whether the first data table is consistent with the second data table.

Optionally, the first data table includes M rows, where M is a positive integer greater than or equal to 1,

The processing unit 1001 is also used for:

Hash the M rows to obtain M first hash groups, the M first hash groups are in one-to-one correspondence with the M rows, and each of the first hash groups includes one row in the M rows The identifier and the hash value of the row data corresponding to the one row identifier, the row identifiers included in each of the first hash groups are different;

mapping the M first hash groups to the N first hash buckets;

The determining unit 1002 is also used for:

The processing unit 1001 is also used for:

Optionally, the hash value of each first leaf node is a hash value obtained by performing an XOR operation on the hash values included in the first hash group included in the corresponding first hash bucket.

Optionally, the determining unit 1002 is further configured to:

The processing unit 1001 is also used for:

Optionally, the first data table includes the full amount of data in at least one data table in the first database.

Optionally, the first data table includes incremental data in at least one data table in the first database.

Optionally, the height of the first Merkle tree is associated with the first data table.

Optionally, the first database and the second database are heterogeneous databases or homogeneous databases.

Optionally, the first database is a relational database or a non-relational database, and the second database is a relational database or a non-relational database.

In the following, with reference to FIG. 11 , the data verification device including a transceiver, a processor and a memory is used as an example for introduction.

FIG. 11 is a schematic structural diagram of a data verification device 1000 provided by the present application. As shown in FIG. 11 , the device 1000 includes: a transceiver 1010 , a processor 1020 and a memory 1030 . The transceiver 1010 , the processor 1020 and the memory 1030 communicate with each other through an internal connection path to transmit control and/or data signals. The memory 1030 is used to store computer programs, and the processor 1010 is used to call from the memory 1030 And run the computer program to control the transceiver 1020 to send and receive signals.

Specifically, the transceiver 1010 can be used to obtain the above-mentioned first data table and second data table, which will not be repeated here.

Specifically, the functions of the processor 1020 correspond to the specific functions of the processing unit 1001 and the determination unit 1002 shown in FIG. 10 , and details are not repeated here.

In this embodiment of the present application, the data verification device should include a processor. Wherein, the data verification device may be any one of the terminal devices described above.

Optionally, in some implementations, the data verification device may further include a transceiver.

Optionally, in some implementations, the data verification device may further include a memory.

FIG. 12 is a schematic structural diagram of a system 1200 provided by the present application. As shown in FIG. 12 , the system 1200 includes: the data verification apparatus 1000 or the data verification device 1100 mentioned above. Optionally, the system 1200 may further include the above-mentioned first database and the above-mentioned second database.

Embodiments of the present application provide a computer program product, which when the computer program product runs on the data verification apparatus 1310, enables the data verification apparatus 1310 to execute the method 100 and/or the method 200 in the above method embodiments.

Those of ordinary skill in the art can realize that, in combination with the method steps and units described in the embodiments disclosed herein, they can be implemented in electronic hardware, computer software, or a combination of the two. Interchangeability, the steps and components of the various embodiments have been generally described in terms of functions in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Persons of ordinary skill in the art may use different methods of implementing the described functionality for each particular application, but such implementations should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the above-described systems, devices and units, reference may be made to the corresponding processes in the foregoing method embodiments, which are not repeated here.

In the several embodiments provided in this application, the disclosed systems, devices and methods may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.

The unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present application.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application are essentially or part of contributions to the prior art, or all or part of the technical solutions can be embodied in the form of software products, and the computer software products are stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

The above descriptions are only specific implementations of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art can easily think of various equivalent modifications within the technical scope disclosed in the present application. or replacement, these modifications or replacements should be covered within the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer program instructions. When the computer program instructions are loaded and executed on a computer, the procedures or functions according to the embodiments of the present application are generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program instructions may be transmitted from a website site, computer, server or data center via Wired or wireless transmission to another website site, computer, server or data center. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes one or more available media integrated. The available media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, digital video discs (DVDs), or semiconductor media (eg, solid state drives), and the like.

Those of ordinary skill in the art can understand that all or part of the steps of implementing the above embodiments can be completed by hardware, or can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium. The storage medium can be read-only memory, magnetic disk or optical disk, etc.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

In addition, the term "and/or" in this application is only an association relationship to describe associated objects, which means that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, and A and B exist at the same time. , there are three cases of B alone. In addition, the character "/" in this document generally indicates that the contextual object is an "or" relationship; the term "at least one" in this application can mean "one" and "two or more", for example, A At least one of , B, and C can mean: A alone exists, B exists alone, C exists alone, A and B exist simultaneously, A and C exist simultaneously, C and B exist simultaneously, and A and B and C exist simultaneously. seven situations.

The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

A data verification method, characterized in that the method comprises:

The first data table in the first database is processed to generate a first Merkle tree, each row of the first data table includes a row identifier and row data, and the first Merkle tree includes N first leaves node, the N first leaf nodes are in one-to-one correspondence with the N first hash buckets, the hash value of each first leaf node is determined according to the corresponding first hash bucket, and the N first leaf nodes are in one-to-one correspondence. The first hash bucket is obtained by hash partitioning the first data table according to the row identifier, any two first hash buckets are different, and N is a positive integer greater than or equal to 2;

Process the second data table in the second database to generate a second Merkle tree, the second data table is obtained by synchronizing or migrating the first data table to the second database, and the second data table is obtained by synchronizing or migrating the first data table to the second database. The Merkle tree includes N second leaf nodes, the N second leaf nodes are in one-to-one correspondence with the N second hash buckets, and the hash value of each second leaf node is based on the corresponding second leaf node. Hash buckets are determined, the N second hash buckets are obtained by hash partitioning the second data table according to row identifiers, and the second data table is hash partitioned according to row identifiers to obtain the The hash rule of the N second hash buckets is the same as the hash rule of the N first hash buckets obtained by hash partitioning the first data table according to row identifiers. Any two second hash buckets have the same hash rule. Are not the same;

Comparing the first Merkle tree with the second Merkle tree, it is determined whether the first data table is consistent with the second data table.
The method according to claim 1, wherein the first data table includes M rows, where M is a positive integer greater than or equal to 1, and the first data table in the first database is processed to generate a first default data table. Kerr tree, including:

Hash processing is performed on the M rows to obtain M first hash groups, the M first hash groups are in one-to-one correspondence with the M rows, and each of the first hash groups includes the M A row identifier in the row and the hash value of the row data corresponding to the one row identifier, the row identifiers included in each of the first hash groups are different; the M first hash groups are mapped to the N first hash buckets;

Determine the hash values of the N first leaf nodes according to the N first hash buckets;

The first Merkle tree is generated according to the hash values of the N first leaf nodes.
The method of claim 2, wherein:

The hash value of each of the first leaf nodes is a hash value obtained by performing an XOR operation on the hash values included in the first hash group included in the corresponding first hash bucket.
The method according to any one of claims 1-3, wherein the comparing the first Merkle tree with the second Merkle tree determines that the first data table and the Whether the second data sheet is consistent, including:

determining the hash value of the first Merkle tree root node and the hash value of the second Merkle tree root node;

If the hash value of the first Merkk tree root node is the same as the hash value of the second Merkk tree root node, it is determined that the first data table is consistent with the second data table;

If the hash value of the first Merkke tree root node is different from the hash value of the second Merkk tree root node, it is determined that the first data table is inconsistent with the second data table.
The method according to claim 4, wherein after determining that the first data table is inconsistent with the second data table, the method further comprises:

It is determined that the hash value of the ith first leaf node is different from the hash value of the ith second leaf node, and the hash value of the ith first leaf node is based on the hash value of the ith first leaf node. Determined by the bucket, the hash value of the i-th second leaf node is determined according to the i-th second hash bucket, i is a positive integer, and 1≤i≤N;

Compare the i-th first hash bucket and the i-th second hash bucket, and determine the row identifiers that are inconsistent between the first data table and the second data table;

The row data corresponding to the inconsistent row IDs are queried from the first database and the second database respectively according to the inconsistent row IDs.
The method according to any one of claims 1-5, wherein the first data table includes the full amount of data in at least one data table in the first database.
The method according to any one of claims 1-5, wherein the first data table includes incremental data in at least one data table in the first database.
The method according to any one of claims 1-7, wherein the height of the first Merkle tree is associated with the first data table.
The method according to any one of claims 1-8, wherein the first database and the second database are heterogeneous databases or homogeneous databases.
The method according to any one of claims 1-9, wherein the first database is a relational database or a non-relational database, and the second database is a relational database or a non-relational database.
A data verification device, characterized in that the device comprises:

a processing unit, configured to process the first data table in the first database to generate a first Merkle tree, each row of the first data table includes a row identifier and row data, and the first Merkle tree includes N first leaf nodes, the N first leaf nodes are in one-to-one correspondence with N first hash buckets, and the hash value of each first leaf node is determined according to the corresponding first hash bucket , the N first hash buckets are obtained by hash partitioning the first data table according to row identifiers, any two first hash buckets are different, and N is a positive integer greater than or equal to 2;

The processing unit is further configured to process the second data table in the second database to generate a second Merkle tree, and the second data table is to synchronize or migrate the first data table to the second data table. Obtained from the database, the second Merkle tree includes N second leaf nodes, the N second leaf nodes are in one-to-one correspondence with the N second hash buckets, and the hash value of each second leaf node is The value is determined according to the corresponding second hash bucket, and the N second hash buckets are obtained by hash partitioning the second data table according to the row identifier, and the second data table is obtained according to the row identifier. The hash rule for obtaining the N second hash buckets by performing hash partitioning on the identifier is the same as the hash rule for obtaining the N first hash buckets by performing hash partitioning on the first data table according to the row identifier, Any two second hash buckets are not the same;

A determining unit, configured to compare the first Merkle tree with the second Merkle tree to determine whether the first data table is consistent with the second data table.
The device according to claim 11, wherein the first data table comprises M rows, where M is a positive integer greater than or equal to 1,

The processing unit is also used to:

Hash processing is performed on the M rows to obtain M first hash groups, the M first hash groups are in one-to-one correspondence with the M rows, and each of the first hash groups includes the M A row identifier in the row and the hash value of the row data corresponding to the one row identifier, the row identifiers included in each of the first hash groups are different;

mapping the M first hash groups to the N first hash buckets;

The determining unit is also used for:

Determine the hash values of the N first leaf nodes according to the N first hash buckets;

The processing unit is also used to:

The first Merkle tree is generated according to the hash values of the N first leaf nodes.
The apparatus of claim 12, wherein:

The hash value of each of the first leaf nodes is a hash value obtained by performing an XOR operation on the hash values included in the first hash group included in the corresponding first hash bucket.
The device according to any one of claims 11-13, wherein the determining unit is further configured to:

determining the hash value of the first Merkle root node and the hash value of the second Merkle root node;

If the hash value of the first Merkk tree root node is the same as the hash value of the second Merkk tree root node, it is determined that the first data table is consistent with the second data table;

If the hash value of the first Merkk tree root node is different from the hash value of the second Merkk tree root node, it is determined that the first data table is inconsistent with the second data table.
The device of claim 14, wherein:

The determining unit is also used for:

It is determined that the hash value of the ith first leaf node is different from the hash value of the ith second leaf node, and the hash value of the ith first leaf node is based on the hash value of the ith first leaf node. Determined by the bucket, the hash value of the i-th second leaf node is determined according to the i-th second hash bucket, i is a positive integer, and 1≤i≤N;

Compare the i-th first hash bucket and the i-th second hash bucket, and determine the row identifiers that are inconsistent between the first data table and the second data table;

The processing unit is also used to:

The row data corresponding to the inconsistent row IDs are queried from the first database and the second database respectively according to the inconsistent row IDs.
The apparatus according to any one of claims 11 to 15, wherein the first data table includes the full amount of data in at least one data table in the first database.
The apparatus according to any one of claims 11-16, wherein the first data table includes incremental data in at least one data table in the first database.
The apparatus according to any one of claims 11-17, wherein the height of the first Merkle tree is associated with the first data table.
The apparatus according to any one of claims 11-18, wherein the first database and the second database are heterogeneous databases or homogeneous databases.
The apparatus according to any one of claims 11-19, wherein the first database is a relational database or a non-relational database, and the second database is a relational database or a non-relational database.
A data verification device, characterized in that it comprises at least one processor and a communication interface, wherein the at least one processor is used to execute a computer program or instruction, so that the data verification is performed as claimed in claims 1 to 10 The method of any of the above.
The data verification apparatus of claim 21, wherein the apparatus further comprises at least one memory coupled to the at least one processor, wherein the computer program or instructions are stored in the at least one memory in a memory.
A computer-readable storage medium for storing computer instructions, when the computer instructions are executed, the method according to any one of claims 1 to 10 is implemented.
A system, characterized by comprising the data verification device as claimed in claim 21 or 22.