WO2022063223A1 - Data verification method, apparatus, and system - Google Patents

Data verification method, apparatus, and system Download PDF

Info

Publication number
WO2022063223A1
WO2022063223A1 PCT/CN2021/120282 CN2021120282W WO2022063223A1 WO 2022063223 A1 WO2022063223 A1 WO 2022063223A1 CN 2021120282 W CN2021120282 W CN 2021120282W WO 2022063223 A1 WO2022063223 A1 WO 2022063223A1
Authority
WO
WIPO (PCT)
Prior art keywords
hash
data table
data
database
row
Prior art date
Application number
PCT/CN2021/120282
Other languages
French (fr)
Chinese (zh)
Inventor
黄凯耀
郑云洲
孟小珍
李龙
赵俊
李志学
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022063223A1 publication Critical patent/WO2022063223A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials

Definitions

  • the present application relates to the field of storage, and more particularly, to a data verification method, apparatus and system.
  • the present application provides a data verification method, device and system, which can be applied in the scenarios of data synchronization or data migration, and the method can better meet the requirements for accurate and fast verification of massive data.
  • a first aspect provides a data verification method, characterized in that the method includes:
  • the first data table in the first database is processed to generate a first Merkle tree, each row of the first data table includes row identifiers and row data, and the first Merkle tree includes N first leaf nodes,
  • the N first leaf nodes are in one-to-one correspondence with the N first hash buckets, the hash value of each first leaf node is determined according to the corresponding first hash bucket, and the N first hash buckets is obtained by hash partitioning the first data table according to the row identifier, any two first hash buckets are different, and N is a positive integer greater than or equal to 2;
  • the second data table in the second database is obtained by synchronizing or migrating the first data table to the second database, the second Merkle tree Including N second leaf nodes, the N second leaf nodes are in one-to-one correspondence with N second hash buckets, and the hash value of each second leaf node is determined according to the corresponding second hash bucket,
  • the N second hash buckets are obtained by hash partitioning the second data table according to row identifiers, and hash partitioning the second data table according to row identifiers to obtain the hashes of the N second hash buckets
  • the rule is the same as the hash rule for obtaining the N first hash buckets by hash partitioning the first data table according to the row identifier, and any two second hash buckets are different;
  • the first Merkle tree is compared with the second Merkle tree to determine whether the first data table is consistent with the second data table.
  • the hash value of each first leaf node of the first Merkle tree corresponding to the first data table is determined according to the corresponding first hash bucket, and the second data table corresponding to the second The hash value of each second leaf node of the Merkle tree is determined according to its corresponding second hash bucket.
  • the hashing rule for obtaining N second hash buckets by hash partitioning the second data table according to row identifiers is different from the hashing rule for obtaining N first hash buckets by hashing partitioning the first data table according to row identifiers are the same, so it can be ensured that the data corresponding to the same row identifier in the second data table and the first data table are mapped to the second hash bucket and the first hash bucket with the same sequence number respectively, so that the generated The first Merkle tree has the same structure as the second Merkle tree. Therefore, it can be directly determined whether the first data table and the second data table are consistent by comparing the hash values of the above two Merkle tree root nodes. Therefore, the data verification method provided by the present application can meet the requirements of accurate and fast verification of massive data.
  • the first data table includes M rows, where M is a positive integer greater than or equal to 1, and the first data table in the first database is processed to generate the first Merkle tree, including:
  • Hash processing is performed on the first data table to obtain M first hash groups, the M first hash groups are in one-to-one correspondence with the M rows, and each of the first hash groups includes the first data table A row identifier in and the hash value of the row data corresponding to the one row identifier, the row identifiers included in each of the first hash groups are different;
  • the first Merkle tree is generated according to the hash values of the N first leaf nodes.
  • the method for generating the second Merkle tree is the same as the above method. Specifically, when the second data table includes K rows, K is a positive integer less than or equal to N, and the second data table in the second database is processed to generate a second Merkle tree, including:
  • Hash processing is performed on the second data table to obtain K second hash groups, the K second hash groups are in one-to-one correspondence with the K rows, and each of the second hash groups includes the second data table A row identifier in and the hash value of the row data corresponding to the one row identifier, the row identifiers included in each of the second hash groups are different;
  • the second Merkle tree is generated according to the hash values of the N second leaf nodes.
  • K when K is equal to M, it can be understood that the number of rows included in the second data table is the same as the number of rows included in the first data table, that is, in the process of synchronizing or migrating the first data table to the second database If there are no missing data.
  • K is less than M, it can be understood that the number of rows included in the second data table is the same as the number of rows included in the first data table, that is, in the process of synchronizing or migrating the first data table to the second database, if there is Data is missing.
  • mapping rule for mapping the K second hash groups to the N second hash buckets is the same as the mapping rule for mapping the M first hash groups to the N first hash buckets.
  • the rules are the same.
  • a data partition algorithm is used to map the row data in the data table to hash buckets, and the hash buckets correspond to the Merkle tree leaf nodes one-to-one. Since the K second hash groups are mapped to the Nth hash buckets The mapping rule for two hash buckets is the same as the mapping rule for mapping M first hash groups to N first hash buckets, so the generated first Merkle tree and the second Merkle tree have the same Structure.
  • the hash value of each of the first leaf nodes is differentiated according to the hash value included in the first hash group included in the corresponding first hash bucket.
  • each second leaf node is a hash value obtained by performing an XOR operation on the hash values included in the second hash group included in the corresponding second hash bucket.
  • the hash value of the leaf node is obtained by performing the XOR operation on the data in the hash bucket, so when the consistency check is performed on the first data table and the second data table, the row data can be avoided.
  • the sorting process is performed, so that the efficiency of data verification can be further improved.
  • the first Merkle tree is compared with the second Merkle tree to determine whether the first data table is consistent with the second data table ,include:
  • the hash value of the root node of the first Merkke tree is different from the hash value of the root node of the second Merkke tree, it is determined that the first data table is inconsistent with the second data table.
  • the structures of the first Merkle tree and the second Merkle tree are exactly the same, when the consistency check is performed , it is possible to accurately and quickly determine whether the first data table and the second data table are consistent by judging whether the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree. Specifically, when the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree, it may be determined that the first data table and the second data table are consistent. When the hash value of the root node of the first Merkle tree is different from the hash value of the root node of the second Merkle tree, it may be determined that the first data table is inconsistent with the second data table.
  • the method further includes:
  • the hash value of the ith first leaf node is different from the hash value of the ith second leaf node, and the hash value of the ith first leaf node is based on the ith first hash bucket Determined, the hash value of the ith second leaf node is determined according to the ith second hash bucket, i is a positive integer, and 1 ⁇ i ⁇ N;
  • the row data corresponding to the inconsistent row IDs are queried from the first database and the second database respectively according to the inconsistent row IDs.
  • the row data corresponding to the inconsistent row IDs can be queried from the first database and the second database according to the determined inconsistent row IDs. Since the size of the data set of the leaf nodes is controllable, the time required for determining the inconsistent row identifiers above is also controllable.
  • the first data table includes the full amount of data in at least one data table in the first database.
  • the first data table includes incremental data in at least one data table in the first database.
  • the full data and the incremental data can be separated, and data verification can be performed as two stages, which can save computational overhead.
  • a Merkle tree with a higher number of layers can be constructed when verifying the full amount of data.
  • a Merkle tree with a lower number of layers can be constructed when verifying the incremental data.
  • the height of the first Merkle tree is associated with the first data table.
  • the height of the first Merkle tree can be adaptively adjusted according to the size of the first data table to be verified.
  • the first database and the second database are heterogeneous databases or homogeneous databases.
  • the method for data verification provided by the present application can be applied to the data consistency verification of homogeneous databases and the data consistency verification of heterogeneous databases.
  • the first database is a relational database or a non-relational database
  • the second database is a relational database or a non-relational database
  • a data verification apparatus executes the method in the first aspect and any possible implementation manner of the first aspect.
  • the data verification device provided in the present application is independently decoupled from the database system, so the data verification device will not cause intrusive effects on the database system. For example, it affects the function and performance of the database system or occupies database system resources.
  • a data verification device in a third aspect, includes a memory and a processor, the memory is used for storing instructions, and the processor is configured to read the instructions stored in the memory, so that the data verification device executes the above-mentioned first A method in an aspect and any possible implementation of the first aspect.
  • a processor including: an input circuit, an output circuit, and a processing circuit.
  • the processing circuit is configured to receive a signal through the input circuit and transmit a signal through the output circuit, so that any aspect of the first aspect and the method of any possible implementation of the first aspect are accomplish.
  • the above-mentioned processor may be a chip
  • the input circuit may be an input pin
  • the output circuit may be an output pin
  • the processing circuit may be a transistor, a gate circuit, a flip-flop, and various logic circuits.
  • the input signal received by the input circuit may be received and input by, for example, but not limited to, a receiver
  • the signal output by the output circuit may be, for example, but not limited to, output to and transmitted by a transmitter
  • the circuit can be the same circuit that acts as an input circuit and an output circuit at different times.
  • the embodiments of the present application do not limit the specific implementation manners of the processor and various circuits.
  • a processing apparatus including a processor and a memory.
  • the processor is configured to read the instructions stored in the memory, and can receive signals through the receiver and transmit signals through the transmitter, so as to execute the first aspect and the method in any possible implementation manner of the first aspect.
  • processors there are one or more processors and one or more memories.
  • the memory may be integrated with the processor, or the memory may be provided separately from the processor.
  • the memory can be a non-transitory memory, such as a read only memory (ROM), which can be integrated with the processor on the same chip, or can be separately set in different On the chip, the embodiment of the present application does not limit the type of the memory and the setting manner of the memory and the processor.
  • ROM read only memory
  • the relevant data interaction process such as sending indication information, may be a process of outputting indication information from the processor, and receiving capability information may be a process of receiving input capability information by the processor.
  • the data output by the processing can be output to the transmitter, and the input data received by the processor can be from the receiver.
  • the transmitter and the receiver may be collectively referred to as a transceiver.
  • a computer-readable storage medium for storing a computer program, the computer program comprising instructions for executing the method in the above-mentioned first aspect and any possible implementation manner of the above-mentioned first aspect.
  • a computer program product comprising instructions that, when run on a computer, cause the computer to execute the method in the above-mentioned first aspect and any possible implementation manner of the above-mentioned first aspect.
  • a system including the data verification apparatus described in the second aspect.
  • a chip including at least one processor and an interface; the at least one processor is used to call and run a computer program, so that the chip executes the above-mentioned first aspect and the above-mentioned first aspect method in any possible implementation of .
  • FIG. 1 is a schematic diagram of a system 100 suitable for the data verification method provided by the present application.
  • FIG. 2 is a schematic diagram of the data verification apparatus 130 provided by the present application.
  • FIG. 3 is a schematic flowchart of the data verification method 100 provided by the present application.
  • FIG. 4 is a schematic diagram of a Merkle tree determined according to the method provided in this application.
  • FIG. 5 is a schematic diagram of extracting data from the first data table provided by the present application.
  • FIG. 6 is a schematic diagram of hash partitioning the data extracted from the first data table provided by the present application.
  • FIG. 7 is a schematic diagram of a Merkle tree determined according to the method provided in the present application.
  • FIG. 8 is a schematic flowchart of a data verification method 200 provided by the present application.
  • FIG. 9 is a schematic diagram of a Merkle tree determined according to the method provided in the present application.
  • FIG. 10 is a schematic structural diagram of a data verification apparatus 1000 provided by the present application.
  • FIG. 11 is a schematic structural diagram of a data verification device 1000 provided by the present application.
  • FIG. 12 is a schematic structural diagram of a system 1200 provided by the present application.
  • the network architecture and service scenarios described in the embodiments of the present application are for the purpose of illustrating the technical solutions of the embodiments of the present application more clearly, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application.
  • the evolution of the architecture and the emergence of new business scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
  • references in this specification to "one embodiment” or “some embodiments” and the like mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.
  • the terms “including”, “including”, “having” and their variants mean “including but not limited to” unless specifically emphasized otherwise.
  • At least one means one or more, and “plurality” means two or more.
  • the character “/” generally indicates that the associated objects are an “or” relationship.
  • “At least one item(s) below” or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s).
  • at least one item (a) of a, b, or c can represent: a, b, c, ab, ac, bc, or abc, where a, b, c can be single or multiple .
  • Data verification is a verification operation to ensure the integrity of the data.
  • a check value is usually calculated on the original data by a specified algorithm.
  • the receiver uses the same algorithm to calculate the check value once. If the check value obtained by the two calculations is the same, it means that the data is consistent.
  • Data replication the technique of copying data from one location to another, involves sharing information to ensure consistency between redundant resources (such as software or hardware components) to improve reliability, fault tolerance or reliability Accessibility.
  • Merkle tree can also be called hash tree.
  • a Merkle tree is a binary tree consisting of a root node, a set of intermediate nodes and a set of leaf nodes. The bottommost leaf node contains the stored data or its hash value, each intermediate node is the hash value of the content of its two child nodes, and the root node is also composed of the hash value of the content of its two child nodes.
  • Hash is a function that maps data of arbitrary length into data of fixed length. A slight change in the input data can cause the result of the hash operation to change beyond recognition, and it is generally considered impossible to reverse the characteristics of the original input data based on the hash value.
  • HDB Heterogeneous database
  • Heterogeneous database is a collection of related multiple database systems, which can realize data sharing and transparent access.
  • Each database system already exists before joining the heterogeneous database system, and has its own database management system (database management system, DBMS).
  • database management system database management system
  • Each component of a heterogeneous database has its own autonomy. While realizing data sharing, each database system still maintains its own application characteristics, integrity control and security control.
  • Homogeneous database means that all sites use a common DBMS software, and each site understands each other and cooperates to deal with the needs of users.
  • RD refers to a database that uses a relational model to organize data. It stores data in the form of rows and columns so that users can understand it.
  • the series of rows and columns of a relational database are called tables, and a group of tables constitute a database.
  • a user retrieves data in a database through a query, which is an execution code that defines certain areas of the database.
  • the relational model can be simply understood as a two-dimensional table model, and a relational database is a data organization composed of two-dimensional tables and the relationships between them.
  • NoSQL is a database that uses a non-relational model to organize data.
  • Non-relational databases can include the following types: key-value store databases (eg, Oracle BDB), column store databases (eg, HBase), document databases (eg, CouchDB or MongoDb), and graph databases.
  • the offline data verification method is usually adopted, and the data generated by the operation of the application software is obtained from the production database (ie, the source database) of each independent data source to a unified offline database (ie, the target database).
  • Data check whether the data of each independent data source is consistent.
  • the frequency of acquiring data needs to be reduced, even when the business volume is small, which in turn will affect offline data.
  • the data usually needs to be sorted, and the data sorting process often needs to occupy a large amount of system resources. Therefore, using the above-mentioned offline data verification method, when verifying massive data (for example, TB-level data), it is usually impossible to meet business requirements.
  • the present application provides a data verification method, device and system, which can better meet the requirements for accurate and rapid verification of massive data.
  • FIG. 1 is a schematic diagram of a system 100 suitable for the data verification method provided by the present application.
  • the system 100 can be used in but not limited to the following scenarios: database data migration scenario or data synchronization scenario.
  • the system 100 may include at least one source database 110 , at least one target database 120 and at least one data verification device 130 .
  • the data verification device 130 is a system on a third-party hardware device independent of the source database 110 and the target database 120, the source database 110 is the database before data migration or replication, and the target database 120 is the data migration or a replicated database.
  • the type of the above-mentioned source database 110 and the type of the above-mentioned target database 120 are not specifically limited.
  • the above-mentioned source database 110 or the above-mentioned target database 120 may be a relational database.
  • the source database 110 or the target database 120 may be any one of the following relational databases: Oracle, DB2, Microsoft SQL Server, Microsoft Access, MySQL. It should be understood that the type of relational database here is only illustrative and does not constitute any limitation to the system 100 .
  • the relational database may also be other types of relational databases other than those listed above.
  • the above-mentioned source-side database 110 or the above-mentioned target-side database 120 may be a non-relational database.
  • the above-mentioned source database 110 or the above-mentioned target database 120 may be any one of the following non-relational databases: NoSQL, Cloudant and MongoDB. It should be understood that the type of the non-relational database here is only illustrative and does not constitute any limitation to the system 100 .
  • the non-relational database may also be other types of non-relational databases other than those listed above.
  • the source database 110 may be a non-relational database
  • the target database 120 may be a relational database
  • the source database 110 may be a NoSQL database
  • the target database 120 may be an Oracle database.
  • the source database 110 and the target database 120 may be homogeneous databases.
  • the source database 110 and the target database 120 may also be heterogeneous databases, which are not limited.
  • the deployment of the above-mentioned source database 110, the above-mentioned target database 120 and the above-mentioned data verification device 130 in the equipment is not specifically limited, but it is necessary to ensure that the above-mentioned data verification device 130 is independent of the above-mentioned source database. 110 and the system of the above-mentioned target database 120 will suffice.
  • the source database 110 may be a physical module or a virtual module deployed on physical device #1
  • the target database 120 may be a physical module or virtual module deployed on physical device #2
  • the testing apparatus 130 may be a physical module or a virtual module deployed on the physical device #3, and the physical device #1, the physical device #2, and the physical device #3 are different devices.
  • the source database 110 and the target database 120 may be different physical modules or virtual modules deployed on physical device #1, and the data verification apparatus 130 may be deployed on physical device #2 physical module or virtual module, and physical device #1 and physical device #3 are different devices.
  • the source database 110 and the target database 120 may interact (eg, data migration or data synchronization, etc.), and the source database 110 and the target database 120 may also interact with the data verification apparatus 130 respectively.
  • the data verification device 130 can extract the data to be verified at the source from the source database 110 and extract the data to be verified at the target from the target database 120 . Verify the data, and perform consistency check on the two extracted data to verify whether the data after data migration or data synchronization is consistent in the source database 110 and the target database 120 .
  • the data verification device 130 determines that the data to be verified extracted from the source database is inconsistent with the data to be verified extracted from the target database, it can further determine which data is inconsistent.
  • the data verification apparatus 130 has a storage function, the result of the consistency verification may also be stored in the data verification apparatus 130 .
  • FIG. 1 is for illustration only and does not constitute any limitation to the system to which the present application applies.
  • the system 100 may further include a larger number of source-end databases 110 and/or target-end databases 120 and/or data verification devices 130 .
  • the data verification apparatus 130 may further include other modules, such as a verification execution module, a source-side data management module to be verified, a target-side data management module to be verified, and the like.
  • FIG. 2 a schematic structural diagram of the data verification apparatus 130 in FIG. 1 provided in the present application will be introduced.
  • FIG. 2 is a schematic diagram of the data verification apparatus 130 provided by the present application.
  • the apparatus 130 may include: a source data extraction module 131 , a source processing module 132 , a target data processing module 133 , a target data extraction module 134 , a comparison module 135 and a storage module 136 .
  • the above-mentioned modules may be connected through internal connection paths.
  • the source-side processing module 132 may interact with the comparison module 135 , the source-side data extraction module 131 , and the target-side processing module 133 .
  • the source-end data extraction module 131 is configured to obtain data from a source-end database (for example, the above-mentioned source-end database 110 ).
  • a source-end database for example, the above-mentioned source-end database 110
  • the source data extraction module 131 can obtain data from the source database 110 in FIG. 1 .
  • the source-end processing module 131 is configured to acquire data from the source-end data extraction module 131, and perform hash processing and data partition processing on the acquired data.
  • the target-end processing module 133 is configured to obtain data from the target-end data extraction module 134, and perform hash processing and data partition processing on the obtained data.
  • the target-end data extraction module 134 is configured to obtain data from a target-end database (eg, the above-mentioned target-end database 120 ).
  • a target-end database eg, the above-mentioned target-end database 120
  • the target data extraction module 134 can obtain data from the target database 120 in FIG. 1 .
  • the comparison module 135 is configured to acquire the Merkle tree corresponding to the data from the source-end processing module 132 and the target-end processing module 133, and perform data consistency verification based on the acquired Merkle tree.
  • the comparison module 135 is the core module of the above-mentioned data verification device 130 .
  • the comparison module 135 may further include a data comparison sub-module and a data reverse check sub-module.
  • the data comparison sub-module can quickly compare and find inconsistent row data identification data sets through Merkle tree, and the data reverse check sub-module can reversely search detailed data from the database according to the inconsistent row data identification, and finally find inconsistent rows. Identifies the corresponding data value.
  • the storage module 136 is used to store data and instructions.
  • FIG. 2 is only for illustration and does not constitute any limitation to the data verification apparatus 130 provided in the present application.
  • the source processing module 132 and the target processing module 133 in the data verification apparatus 130 may also be included in the same processing module.
  • the source data extraction module 131 and the target data extraction module 134 in the data verification apparatus 130 may also be included in the same processing module.
  • the comparison module 135 in the data verification apparatus 130 has the function of the storage module 136
  • the data verification apparatus 130 may also not include the storage module 136 .
  • FIG. 3 is a schematic flowchart of the data verification method 100 provided by the present application.
  • the method 100 may include steps 110 to 130 . Steps 110 to 130 will be described in detail below.
  • the execution subject of steps 110 to 130 may be the data verification device 130 shown in FIG. 2 .
  • Step 110 Process the first data table in the first database to generate a first Merkle tree.
  • the first database can understand the production database, that is, the source database.
  • the first database may be the source database 110 shown in FIG. 1 .
  • the data source and data size included in the first data table are not limited.
  • the first data table may include the full amount of data in at least one data table in the first database.
  • the first data table may further include two or even more data tables in the first database.
  • the first data table can be understood as a data set composed of two or more data tables in the first database.
  • the first data table may include all data in data table #1.
  • the first data table may also include all the data in data table #1 and data table #3.
  • the first data table may also include all data in data table #1, data table #2, and data table #3.
  • the first data table may include incremental data in at least one data table in the first database. That is to say, the data verification method provided by the present application can also perform consistency verification only on the changed data in the data table.
  • data table #1 and data table #2 are consistent, wherein data table #2 is obtained by copying data table #1.
  • part of the data in data table #1 is changed (eg, data is updated, data is increased, or data is decreased, etc.).
  • the changed data of the above-mentioned data table #1 can be considered as the data included in the first data table.
  • each row of the first data table may include row identifiers and row data
  • the first Merkle tree includes N first leaf nodes, N first leaf nodes and N first hash buckets one by one
  • the hash value of each first leaf node is determined according to the corresponding first hash bucket
  • the N first hash buckets are obtained by hash partitioning the first data table according to the row identifier. Any two The first hash buckets are different, and N is a positive integer greater than or equal to 2.
  • the types of row identifiers included in the first data table are not specifically limited.
  • the row ID may be a numeric row ID.
  • a numeric row ID can be "5".
  • the above row identifier may also be a string type row identifier.
  • the string-type row identifier can be "Zhang San” or “Li Si”, etc.
  • the string type row identifier needs to be processed to obtain a hash value corresponding to the string type row identifier.
  • the N first hash buckets are obtained by hash partitioning the first data table according to row identifiers, which can be understood as whether the row identifiers of the first data table are numeric row identifiers or string row identifiers.
  • row identifiers which can be understood as whether the row identifiers of the first data table are numeric row identifiers or string row identifiers.
  • the above-mentioned N first leaf nodes correspond to the N first hash buckets one-to-one. It can be understood that the i-th first leaf node (that is, the first leaf node with serial number i) among the N first leaf nodes is the same as the The i-th first hash bucket (that is, the first hash bucket with the serial number i) in the N first hash buckets corresponds to. That is to say, the first leaf node with sequence number i corresponds to the first hash bucket with sequence number i. Among them, the serial number of each first leaf node in the N first leaf nodes is different, the serial number of each first hash bucket in the N first hash buckets is different, and i is greater than or equal to 1 and less than or equal to N positive integer.
  • the above N first hash buckets are obtained by hash partitioning the first data table according to row identifiers, and the corresponding relationship between the first hash bucket and the first data table is not specifically limited in this application.
  • each first hash bucket is determined from a row in the first data table. At this time, each first hash bucket corresponds to a row of the first data table. In this case, the number of row identifiers in the first data table included in each first hash bucket is the same.
  • At least one of the N first hash buckets is determined from two or more rows in the first data table. At this time, at least one first hash bucket corresponds to two or more rows of the first data table. In this case, the number of row identifiers in the first data table included in each first hash bucket may be different.
  • At least one of the N first hash buckets may also be empty. That is to say, the N-1 first hash buckets are determined according to all row data included in the first data table, and the remaining one first hash bucket does not include any data in the first data table.
  • any two of the above first hash buckets are different, it can be understood that the sequence numbers corresponding to any two first hash buckets are not the same, and the row identifiers in the first data table included in any two first hash buckets are not the same. .
  • serial number of hash bucket #1 is 1, and hash bucket #1 includes 2 row identifiers in the first data table, which are "5" and "6" respectively, and the serial number of hash bucket #2 is 2, And the hash bucket #2 includes 1 row identifier in the first data table, which is "1". In this case, hash bucket #1 can be considered to be different from hash bucket #2.
  • the first data table may include M rows, where M is a positive integer greater than or equal to N.
  • the above-mentioned processing of the first data table in the first database to generate the first Merkle tree may include the following steps:
  • Hash the M lines to obtain M first hash groups, the M first hash groups are in one-to-one correspondence with the M lines, and each first hash group includes a line identifier in the M lines and a line Identifies the hash value of the corresponding row data, and the row identifiers included in each first hash group are different;
  • the number of first hash groups included in each of the N first hash buckets is not specifically limited.
  • the number of first hash groups included in each of the above N first hash buckets may be the same.
  • the number of first hash groups included in each of the above N first hash buckets may also be different.
  • the number of first hash groups included in a part of the first hash buckets in the above N first hash buckets is the same, and the number of first hash groups included in the remaining part of the first hash buckets is different of.
  • the hash value of each first leaf node may be a hash value obtained by performing an XOR operation on the hash values included in the first hash group included in the corresponding first hash bucket. It should be understood that when the first hash bucket corresponding to the first leaf node does not include any one of the M first hash groups, the hash value of the first leaf node may be empty. .
  • the above determination of the hash values of the N first leaf nodes according to the N first hash buckets may include:
  • the hash of the at least one first leaf node is a hash value obtained by performing an XOR operation on the hash values included in at least one of the M first hash groups included in the corresponding first hash bucket.
  • the first hash bucket corresponding to the at least one first leaf node may further include two or more first hash groups among the M first hash groups.
  • the hash of the at least one first leaf node value equal to zero.
  • the height of the first Merkle tree described above is associated with the first data table. Before establishing the first Merkle tree, it is also necessary to determine the relevant parameters of the first Merkle tree according to the size of the first data table, for example, the number of leaf nodes included in the first Merkle tree, and the tree height. Wherein, the tree height of the first Merkle tree will adaptively change with the size of the data included in the first data table. The larger the amount of data included in the first data table, the higher the tree height of the first Merkel number. In other words, when the first data table includes a relatively large amount of data (for example, 1GB), the height of the corresponding first Merk tree is higher than that when the first data table includes a relatively small amount of data (for example, 100MB). The height of a Merkle tree.
  • a Merkle tree (ie, an example of the above-mentioned Merkle tree), the tree height of the Merkle tree is 3, the number of intermediate nodes is 2, and the number of leaf nodes is 4.
  • the topmost layer is the root node, the second top layer is the intermediate node, the next layer is the leaf node, and the bottommost layer is the hash bucket described above (ie, an example of the first hash bucket above).
  • the four leaf nodes of the Merkle tree may be respectively marked as: leaf node 1, leaf node 2, leaf node 3 and leaf node 4.
  • the four hash buckets of the Merkle tree can be marked as: hash bucket 1, hash bucket 2, hash bucket 3, and hash bucket 4.
  • the four leaf nodes of the Merkle tree correspond to the four hash buckets one-to-one. Specifically, leaf node 1 corresponds to hash bucket 1, leaf node 2 corresponds to hash bucket 2, leaf node 3 corresponds to hash bucket 3, and leaf node 4 corresponds to hash bucket 4.
  • the hash value of each leaf node of the Merkle tree is obtained by performing an XOR operation on the hash value included in the hash bucket corresponding to each leaf node.
  • the hash value of each intermediate node of the Merkle tree is obtained by hashing the hash values of its two child nodes.
  • H(N0, N1) represents the hash value of the two leaf nodes of this intermediate node (ie, N0 and N1 ) is the result of the hash operation.
  • the hash value of the root node of the Merkle tree is obtained by hashing the hash values of its two child nodes.
  • H(N4,N5) represents the hash value of the root node of the Merkle tree.
  • the i-th leaf node #1 above can be understood as the leaf node #1 with the serial number i
  • FIG. 4 is for illustration only and does not constitute any limitation to the present application.
  • the Merkle tree shown in FIG. 4 may also include a greater number of leaf nodes.
  • the Merkle tree shown in FIG. 4 may also have a higher tree height.
  • step 110 the following operation may also be included: acquiring the first data table from the first database.
  • FIG. 5 and FIG. 6 are for illustration only, and do not constitute any limitation to the method for obtaining the first data table in the present application.
  • FIG. 5 is a schematic diagram of extracting data from the first data table provided by the present application. It should be understood that FIG. 5 is only an example. For example, a greater number (eg, 100 rows) or a lesser number (eg, 4 rows) of row data may also be included in data table #1. For example, the data extraction module may also include a higher number of threads.
  • the execution body for extracting data from the first data table may be a data extraction module.
  • the data extraction module may be the source-end data extraction module 131 and the target-end data extraction module 134 shown in FIG. 2 . That is, the source-side data extraction module 131 and the target-side data extraction module 134 in FIG. 2 have the data extraction function described below.
  • extracting data from the first data table may include, but is not limited to, the following steps:
  • the processed S batches of data are enqueued into S queues, and the S batches of data are in one-to-one correspondence with the S queues.
  • the number of the above processing threads may be set according to the size of the first data table. For example, when the first data table is larger, a larger number of processing threads may be set. For example, when the first data table is smaller, a smaller number of processing threads may be provided.
  • row data can be extracted from the first data table in batches, each batch of data can be processed by a separate thread, these threads can be executed in parallel, and the extracted data is put into the corresponding data queue .
  • the same processing thread can also be used to process the data in the data table.
  • the above-mentioned data extraction module may be the source-side data extraction module 131 in FIG. 2
  • the above-mentioned data extraction module may be the target-side data extraction module 134 in FIG. 2 . That is to say, the source-side data extraction module 131 and the target-side data extraction module 134 in FIG. 2 have the functions of the above-mentioned data extraction modules.
  • the data extraction module may include two threads responsible for extracting data, namely thread #1 and thread #2, thread #1 may be responsible for extracting the first batch of data, and thread #2 may be responsible for extracting the second batch of data. Thread #1 puts the extracted data into queue #1, and thread #2 puts the extracted data into queue #2.
  • the data extraction threads that is, the above-mentioned thread #1 and the above-mentioned thread #2 can be executed in parallel to improve the extraction efficiency.
  • FIG. 6 is a schematic diagram of hash partitioning data extracted from the first data table according to row identifiers provided by the present application.
  • the execution subject for hash partitioning the data extracted from the data table #1 (ie, an example of the first data table) according to row identifiers may be a data processing module.
  • the data processing module may be the source-end processing module 132 and the target-end data processing module 133 shown in FIG. 2 . That is, the source-side processing module 132 and the target-side data processing module 133 in FIG. 2 have the hash partitioning function described below.
  • data table #1 includes 8 pieces of data, and the row identifiers corresponding to these 8 pieces of data are 1, 2, 3, . . . , 8 respectively.
  • each row in the hash data queue #1 may be recorded as one hash group (ie, an example of the above-mentioned first hash group).
  • the row ID of the first hash group in hash data queue #1 is 1, and the stored hash value is 0xffe898.
  • the row ID of the 5th hash group in hash data queue #1 is 1, and the stored hash value is 0xb8bdd.
  • FIG. 6 which is not exemplified one by one here.
  • a hash operation may also be performed on the row identifier and row data of each row of data to obtain a hash value corresponding to the row identifier and a hash value corresponding to the row data.
  • the hash data queue #1 includes 8 hash groups, the result obtained by identifying the row of the first hash group with 1 modulo 4 is 1, and the row identification of the fifth hash group with 5 modulo 4 The obtained result is 1, so the first hash group and the fifth hash group can be transferred to the first hash bucket #1 (that is, the hash bucket with the serial number of 1). That is to say, the row data with the row ID of 1 included in the data table #1 is mapped into the hash bucket #1 with the serial number of 1.
  • the above processing can be performed on other hash groups in the hash data queue #1, and it can be obtained that the second hash group and the sixth hash group are mapped to the second hash bucket #1, the third hash group The 1st hash group and the 7th hash group are mapped to the 3rd hash bucket #1, and the 4th hash group and the 8th hash group are mapped to the 4th hash bucket #1.
  • FIG. 6 is only an example.
  • a greater number (eg, 8) or a lesser number (eg, 2) of hash bucket #1 may also be included.
  • a greater number (eg, 100 rows) or a lesser number (eg, 4 rows) of row data may also be included in data table #1.
  • Step 120 Process the second data table in the second database to generate a second Merkle tree.
  • the second database can be understood as the target database.
  • the second database may be the target database 120 shown in FIG. 1 .
  • the second data table is obtained by synchronizing or migrating the first data table to the second database
  • the second Merkle tree includes N second leaf nodes, N second leaf nodes and N second leaf nodes.
  • Hash buckets are in one-to-one correspondence
  • the hash value of each second leaf node is determined according to the corresponding second hash bucket
  • the N second hash buckets are obtained by hash partitioning the second data table according to row identifiers
  • the hash rule for hash partitioning the second data table according to the row ID to obtain N second hash buckets and the hash partitioning for the first data table according to the row ID to obtain N first hash buckets
  • the rules are the same, and any two second hash buckets are not the same.
  • the second Merkle tree includes N first leaf nodes, and the first Merkle tree also includes N first leaf nodes. Since the two Merkle trees include the same number of leaf nodes, it can be considered that the first Merkle tree and the second Merkle tree have the same tree height. That is to say, the first Merkle tree and the second Merkle tree provided by the present application have the same tree height.
  • the above-mentioned N second leaf nodes are in one-to-one correspondence with the N second hash buckets.
  • the i-th second hash bucket that is, the second hash bucket with the serial number i
  • the serial number of each second leaf node in the N second leaf nodes is different
  • the serial number of each second hash bucket in the N second hash buckets is different
  • i is greater than or equal to 1 and less than or equal to N positive integer.
  • the hash value of each second leaf node is determined according to the corresponding second hash bucket.
  • a specific determination method refer to the method for determining the first leaf node according to the corresponding first hash bucket in step 110 .
  • N second hash buckets are obtained by hash partitioning the second data table according to row identifiers, and the corresponding relationship between the second hash bucket and the second data table is not specifically limited in this application.
  • each second hash bucket is determined from a row in the second data table.
  • At least one of the N second hash buckets is determined from two or more rows in the second data table.
  • At least one second hash bucket among the above N second hash buckets may also be empty.
  • N second leaf nodes correspond to the N second hash buckets one-to-one, which can be understood as the i-th second leaf node in the N second leaf nodes and the i-th second leaf node in the N second hash buckets.
  • i is a positive integer greater than or equal to 1 and less than or equal to N. It should also be understood that the sequence numbers of each of the N second leaf nodes are different.
  • the above-mentioned second data table is obtained by synchronizing or migrating the first data table to the second database, and may include the following situations:
  • the number of rows included in the second data table is the same as the number of rows included in the first data table.
  • the first data table includes 10 rows, and each row includes row identifiers and row data.
  • the second data table can be obtained after synchronization or migration.
  • the data table also includes 10 rows.
  • the number of rows included in the second data table is the same as the number of rows included in the first data table.
  • the first data table includes 10 rows, and each row includes row identifiers and row data.
  • each row includes row identifiers and row data.
  • the second data table will be obtained after synchronization or migration.
  • the data table includes 9 rows.
  • the number of rows included in the second data table in this application may be the same as the number of rows included in the first data table, or the number of rows included in the second data table may also be smaller than the number of rows included in the first data table.
  • the above hashing rule for hash partitioning the second data table according to row identifiers to obtain N second hash buckets and the hashing rule for hashing partitioning the first data table according to row identifiers to obtain N first hash buckets It can be understood that since the hash rules for partitioning the second data table and the first data table according to row identifiers are the same, it can be guaranteed that the same row identifiers in the second data table and the first data table are The corresponding data will be mapped to the hash bucket with the same label.
  • any two second hash buckets are different, it can be understood that the sequence numbers corresponding to any two second hash buckets are different, and the row identifiers included in any two non-empty second hash buckets are different.
  • K is a positive integer less than or equal to M
  • the second data table in the second database is processed to generate a second Merck Er tree, which can include:
  • Hash the K rows to obtain K second hash groups the K second hash groups are in one-to-one correspondence with the K rows, and each second hash group includes a row identifier in the K rows and a row identifying the hash value of the corresponding row data, and the row identifiers included in each second hash group are different;
  • a second Merkle tree is generated according to the hash values of the N second leaf nodes.
  • K when K is equal to M, it can be understood that the number of rows included in the second data table is the same as the number of rows included in the first data table, that is, in the process of synchronizing or migrating the first data table to the second database If there are no missing data.
  • K is less than M, it can be understood that the number of rows included in the second data table is the same as the number of rows included in the first data table, that is, in the process of synchronizing or migrating the first data table to the second database, if there is Data is missing.
  • mapping rule for mapping K second hash groups to N second hash buckets is the same as the mapping rule for mapping M first hash groups to N first hash buckets, that is, to The hash rule for obtaining N second hash buckets by hash partitioning the second data table according to row identifiers is the same as the hash rule for obtaining N first hash buckets by hash partitioning the first data table according to row identifiers.
  • the row identifier included in the first hash group with the sequence number 1 is 1 and the hash value of the corresponding row data, and the first hash group with the sequence number 1 is mapped to the first hash bucket with the sequence number 5.
  • the second data table includes a row with a row ID of 1, the row with the row ID of 1 corresponds to a second hash group with a sequence number of 1, and the second hash group with a sequence number of 1 is mapped to a second hash group with a sequence number of 5.
  • the hash value of the second leaf node is also empty.
  • step 120 it may also include acquiring a second data table from a second database.
  • the present application does not specifically limit the manner of acquiring the second data table.
  • a check mark can be entered in the first data table of the first database, and in the process of copying the second database from the first database, when the above check mark is detected in the second database, the check mark can be marked in the second database.
  • the data with the above-mentioned check mark is used as the data in the second data table.
  • step 120 the content not described in detail in step 120 is the same as the content described in the foregoing step 110.
  • the foregoing step 110 which will not be described in detail here.
  • Step 130 Compare the first Merkle tree with the second Merkle tree to determine whether the first data table is consistent with the second data table.
  • first data table is consistent with the second data table can be understood as the number of row identifiers stored in the first data table and the second data table is the same, and the content of the row data corresponding to the same row identifier is also the same.
  • comparing the first Merkle tree with the second Merkle tree to determine whether the first data table is consistent with the second data table may include:
  • the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree, it is determined that the first data table is consistent with the second data table;
  • the hash value of the root node of the first Merkle tree is different from the hash value of the root node of the second Merkk tree, it is determined that the first data table is inconsistent with the second data table.
  • the hash value of each node of the Merkle tree is obtained by performing hash operation on the child nodes of each node, for example, the hash value of the root node is determined according to the two intermediate nodes corresponding to the root node. Yes, the hash value of the leaf node is determined according to the data in the hash bucket corresponding to the leaf node. Therefore, when the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree, it can be considered that the first data table and the second data table are consistent.
  • the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree, it may be considered that there is a difference between the first data table and the second data table, that is, they are inconsistent.
  • the hash value of the ith first leaf node is different from the hash value of the ith second leaf node, and the hash value of the ith first leaf node is determined according to the ith first hash bucket , the hash value of the ith second leaf node is determined according to the ith second hash bucket, i is a positive integer, and 1 ⁇ i ⁇ N;
  • the row data corresponding to the inconsistent row IDs are queried from the first database and the second database respectively according to the inconsistent row IDs.
  • the hash value of the i-th first leaf node is different from the hash value of the i-th second leaf node. If the hash value of the second leaf node is compared and it is determined that the hash value is the same, continue to compare the hash value of the leaf node with serial number 2, and so on, until the hash value of the first leaf node with serial number i is determined It is not the same as the hash value of the second leaf node with sequence number i.
  • the first leaf node with serial number i can be understood as the ith first leaf node, and the second leaf node with serial number i can be understood as the ith second leaf node.
  • the hash value of the first leaf node with the serial number N and the hash value of the second leaf node with the serial number N may include:
  • the row identifier included in a hash group in the i-th first hash bucket is the same as the row identifier included in a hash group in the i-th second hash bucket, but the corresponding hash values are different;
  • the above row identifier is a row identifier that is inconsistent between the first data table and the second data table.
  • the process of locating inconsistent row data may be as follows: first, row data identifiers are extracted from inconsistent row data, and corresponding row data are searched from the source database and the target database respectively through the row data identifiers, The row data contains the data of each row, and the inconsistent row data is found by means of direct ratio.
  • the data comparison module when the above data comparison module performs data consistency comparison, it is a top-down comparison process. After finding specific inconsistent data, the data storage module stores the information in a non-volatile storage medium for query when needed.
  • the following describes the process of comparing the two Merkle trees for consistency according to the method provided in the foregoing step 130 .
  • the two Merkle trees shown in FIG. 7 can be obtained according to the above-mentioned steps 110 and 120.
  • these two Merkle trees are denoted as Merkle tree #1 (that is, the above-mentioned first An example of a Merkle tree) and Merkle tree #2 (ie, an example of the second Merkle tree described above).
  • each node is described by taking Merkle tree #1 as an example.
  • the hash value of each node of Merkle tree #2 is obtained by a similar method.
  • Four leaf nodes #1 of Merkle tree #1 ie, an example of the first leaf node
  • hash buckets #1 ie, an example of the first hash bucket.
  • the four leaf nodes #1 are respectively recorded as the first leaf node #1, the second leaf node #1, the third leaf node #1, and the fourth leaf node #1.
  • From left to right record these 4 hash buckets #1 as the first hash bucket #1, the second hash bucket #1, the third hash bucket #1, and the fourth hash bucket# 1.
  • Each hash bucket #1 includes 2 hash groups, and each hash group includes a row ID and a hash value.
  • the hash value of the first leaf node #1 is obtained by XORing the hash values included in the two hash groups in the first hash bucket #1, that is, the hash value of the first leaf node #1.
  • the Greek value N0 XOR(1,5), where XOR(1,5) means to perform an XOR operation on the hash value corresponding to the row ID 1 and the hash value corresponding to the row ID 5, that is, XOR(1, 5) is equal to XOR(011,101)(011,101).
  • the hash value of intermediate node #1 is obtained by hashing the hash values of its two child nodes.
  • H(N0, N1) represents the hash value of the two leaf nodes #1 of this intermediate node #1
  • the result of hashing the hash values ie, N0 and N1.
  • the hash value of root node #1 is obtained by hashing the hash values of its two child nodes.
  • H(N4, N5) represents the hash value of root node #1 of Merkle tree #1.
  • the process of performing consistency check on Merkle tree #1 and Merkle tree #2 may be: first, compare whether the hash values of the root nodes are the same.
  • the i-th leaf node #1 above can be understood as the leaf node #1 with the serial number i
  • FIG. 7 is only for illustration and does not constitute any limitation to the present application.
  • each of hash bucket #1 and hash bucket #1 shown in FIG. 7 may include a different number of hash groups.
  • the hash values included in hash bucket #1 and hash bucket #1 shown in FIG. 7 may also be hash values of a larger magnitude.
  • the types of the first database and the second database involved in the above steps 110 to 130 are not specifically limited.
  • the first database and the second database may be heterogeneous databases.
  • the first database and the second database may be homogeneous databases.
  • the first database may be a relational database or a non-relational database
  • the second database may be a relational database or a non-relational database
  • the first database may be a relational database
  • the second database may be a non-relational database
  • the first database may be a relational database
  • the second database may be a relational database
  • Relational databases usually store data in tabular form, so data can be directly extracted from relational databases to construct the first data table, while non-relational databases are usually in non-tabular form (for example, documents, key-value or graph structures, etc.) Therefore, before extracting data from the non-relational database to construct the first data table, it is also necessary to convert the data to be verified in the non-relational database into the form of table storage, wherein each row of the table may include a row ID and one or more row data.
  • the data in the production database (that is, an example of the above-mentioned first database) changes dynamically.
  • the verified data changes again, which needs to be verified again.
  • the update of each leaf will cause the recalculation of the hash values of the layers above it.
  • the data verification method provided by this application can also verify the full data and the incremental data respectively. Specifically, a Merkle tree with a higher number of layers is used for verification on the full amount of data, and a Merkle tree with a lower number of layers is used for verification on the incremental data, thereby effectively reducing computational overhead.
  • the method for acquiring incremental data to be verified from the data table is not specifically limited.
  • an existing method for acquiring incremental data to be verified in a data table may be used.
  • other methods for acquiring incremental data to be verified may also be used.
  • data table #1 (ie, an example of the first data table described above) includes 10 rows of data
  • data table #2 (ie, an example of the second data table described above) includes 10 rows of data
  • data table #2 is a pair of data Table #1 was reproduced.
  • the methods from steps 110 to 130 above can be used to generate Merkle tree #1 (ie, an example of the first Merkle tree above) according to data table #1
  • Merkle tree #2 to be generated according to data table #2 Merkle tree #2 (ie, an example of the above-mentioned second Merkle tree), and by comparing Merkle tree #1 and Merkle tree #2, it is determined whether data table #1 and data table #2 are consistent.
  • the data in rows 5 to 10 in data table #1 is updated, and at this time, the updated data table #1 can be recorded as data table #3 (that is, the above-mentioned first data An example of a table), data table #4 (that is, an example of the above-mentioned second data table) is obtained by duplicating data table #3.
  • data table #3 that is, the above-mentioned first data
  • data table #4 that is, an example of the above-mentioned second data table
  • the methods of steps 110 to 130 above may be used to perform consistency check only on incremental data.
  • Merkle tree #3 (that is, an example of the above-mentioned first Merkle tree) is generated according to data table #3
  • Merkle tree #4 (that is, the above-mentioned second Merkle tree) is generated according to data table #4 An example of a tree), and by comparing Merkle tree #3 and Merkle tree #4 to determine whether Data Table #3 and Data Table #4 are consistent.
  • the data verification method provided by the present application can better meet the requirements for accurate and fast verification of massive data in different scenarios (eg, online data or offline data).
  • the hash value of each first leaf node of the first Merkle tree corresponding to the first data table is determined according to the corresponding first hash bucket
  • the second data table corresponding to the second The hash value of each second leaf node of the Merkle tree is determined according to its corresponding second hash bucket.
  • the hashing rule for obtaining N second hash buckets by hash partitioning the second data table according to row identifiers is different from the hashing rule for obtaining N first hash buckets by hashing partitioning the first data table according to row identifiers are the same, so it can be ensured that the data corresponding to the same row identifier in the second data table and the first data table are mapped to the second hash bucket and the first hash bucket with the same sequence number respectively, so that the generated The first Merkle tree has the same structure as the second Merkle tree. Therefore, it can be directly determined whether the first data table and the second data table are consistent by comparing the hash values of the above two Merkle tree root nodes.
  • the data verification method provided by the present application can further determine the inconsistent row identifiers by comparing the hash values of the above two Merkle tree root nodes and the corresponding row data.
  • the data verification method provided by the present application can also adaptively adjust the tree heights of the first Merkle tree and the second Merkle tree according to the size of the data set to be verified.
  • the method 200 includes steps 210 to 290 , and the steps 210 to 290 are described below.
  • the execution subject of steps 210 to 290 may be the data verification apparatus 130 shown in FIG. 2 .
  • Step 210 start.
  • step 210 indicates that the data consistency check is started.
  • step 220 check mark bits are added to the data to be replicated in database #1 (ie, an example of the first database in the above method 100).
  • the method of inserting a check mark into the data to be copied may be the same as the existing method, and details are not described herein again.
  • Step 230 Database #2 (ie, an example of the second database in the above-mentioned method 100) replicates the above-mentioned data to be replicated.
  • Step 240 database #1 detects the flag bit and acquires data table #1.
  • the method of acquiring the data table #1 after detecting the flag bit from the database #1 can be the same as the existing method, and details are not described here.
  • Step 250 database #2 detects the flag bit, and acquires data table #2.
  • Step 260 according to the data table #1, generate a Merkle tree #1 (that is, an example of the first Merkle tree in the above method 100).
  • Step 270 according to data table #2, generate Merkle tree #2 (ie, an example of the second Merkle tree in the above method 100)
  • the method for determining the Merkle tree in the above steps 260 and 270 is the same as the method for determining the Merkle tree in the method 100. For details, refer to the above step 110, which will not be described in detail here.
  • Step 280 Determine whether the hash value of the root node of Merkle tree #1 is the same as the hash value of the root node of Merkle tree #2.
  • the above determines whether the hash value of the root node of Merkle tree #1 is the same as the hash value of the root node of Merkle tree #2, including:
  • step 281 is performed;
  • step 282 and step 283 are performed.
  • Step 281 it is determined that the data table #1 and the data table #2 are consistent.
  • Step 282 Compare the hash value of the leaf node of Merkle tree #1 with the hash value of the leaf node of Merkle tree #2, and determine inconsistent row identifiers.
  • step 281 and step 282 is not specifically limited.
  • step 281 may be performed first and then step 282 may be performed.
  • step 282 may be performed first and then step 281 may be performed.
  • Step 283 Determine inconsistent row data from database #1 and database #2 according to the above determined inconsistent row identifiers.
  • the determined inconsistent row identifiers and corresponding row data may also be stored in a storage module of the data verification apparatus, for example, the storage module 136 shown in FIG. 2 .
  • Step 290 end.
  • the above step 290 indicates ending the data consistency check.
  • FIG. 8 is only for illustration and does not impose any limitation on the data verification process provided by the present application.
  • steps 282 and 283 may not be performed after it is determined that data table #1 and data table #2 are inconsistent.
  • FIG. 9 is a schematic diagram of a Merkle tree determined according to the method provided in the present application.
  • Merkle tree #3 that is, an example of the first Merkle tree above
  • Merkle tree #4 that is, the first Merkle tree described above
  • Another example of Merkle tree Merkle tree
  • Merkle tree #5 that is, an example of the above-mentioned second Merkle tree
  • Merkle tree #6 that is, another example of the above-mentioned second Merkle tree
  • Merkle #3 and Merkle #5 trees have a height of 3
  • Merkle #4 and Merkle #6 trees have a height of 2.
  • Merkle tree #3, Merkle tree #4, Merkle tree #5, and Merkle tree #6, please refer to the content described in FIG. 7 above, and will not be repeated here.
  • Merkle tree #3 can be understood as a Merkle tree generated at time #1 based on the full amount of data in data table #1 (that is, an example of the first data table above).
  • Merkle tree #4 can be understood as the Merkle tree generated according to the incremental data in data table #1 at time #2, time #2 is a time after time #1, and at time #2 data table
  • the incremental data in #1 includes row data corresponding to the following row identifiers in data table #1: "2", “6", “7", and "8”.
  • Merkle tree #5 can be understood as a Merkle tree generated at time #1 from the full amount of data in data table #2 (that is, an example of the second data table above).
  • Merkle tree #6 can be understood as a Merkle tree generated according to the incremental data in data table #2 at time #2, and the incremental data in data table #1 at time #2 includes data table #2
  • the corresponding row data are identified in the following rows: "2", “6", “7” and "8".
  • the data table #2 is a data table obtained after migrating or synchronizing the data table #1.
  • an independent Merkle treelet is used to perform data verification on incremental data, which can further save computational overhead and improve the efficiency of data consistency verification.
  • the data verification device should include a processing unit and a determination unit.
  • the data verification device may be the data verification device 130 above.
  • the data verification apparatus may further include a transceiver unit.
  • the data verification device includes a processing unit and a determination unit as an example for introduction.
  • FIG. 10 is a schematic structural diagram of a data verification apparatus 1000 provided by the present application.
  • the apparatus 1000 includes: a processing unit 1001 and a determination unit 1002 .
  • the processing unit 1001 is configured to process the first data table in the first database to generate a first Merkle tree, each row of the first data table includes a row identifier and row data, and the first Merkle tree includes N
  • the N first leaf nodes are in one-to-one correspondence with the N first hash buckets, and the hash value of each first leaf node is determined according to the corresponding first hash bucket.
  • the first hash bucket is obtained by hash partitioning the first data table according to the row identifier, any two first hash buckets are different, and N is a positive integer greater than or equal to 2;
  • the processing unit 1001 is further configured to process a second data table in the second database to generate a second Merkle tree, where the second data table is obtained by synchronizing or migrating the first data table to the second database , the second Merkle tree includes N second leaf nodes, the N second leaf nodes are in one-to-one correspondence with the N second hash buckets, and the hash value of each second leaf node is based on the corresponding Determined by the second hash bucket, the N second hash buckets are obtained by hash partitioning the second data table according to the row ID, and the N second data table is obtained by hash partitioning the second data table according to the row ID
  • the hash rule of the second hash bucket is the same as the hash rule of the N first hash buckets obtained by hash partitioning the first data table according to the row identifier, and any two second hash buckets are different;
  • the determining unit 1002 is configured to compare the first Merkle tree with the second Merkle tree to determine whether the first data table is consistent with the second data table.
  • the first data table includes M rows, where M is a positive integer greater than or equal to 1,
  • the processing unit 1001 is also used for:
  • Hash the M rows to obtain M first hash groups, the M first hash groups are in one-to-one correspondence with the M rows, and each of the first hash groups includes one row in the M rows
  • the identifier and the hash value of the row data corresponding to the one row identifier, the row identifiers included in each of the first hash groups are different;
  • the determining unit 1002 is also used for:
  • the processing unit 1001 is also used for:
  • the first Merkle tree is generated according to the hash values of the N first leaf nodes.
  • the hash value of each first leaf node is a hash value obtained by performing an XOR operation on the hash values included in the first hash group included in the corresponding first hash bucket.
  • the determining unit 1002 is further configured to:
  • the hash value of the root node of the first Merkke tree is different from the hash value of the root node of the second Merkke tree, it is determined that the first data table is inconsistent with the second data table.
  • the determining unit 1002 is further configured to:
  • the hash value of the ith first leaf node is different from the hash value of the ith second leaf node, and the hash value of the ith first leaf node is based on the ith first hash bucket Determined, the hash value of the ith second leaf node is determined according to the ith second hash bucket, i is a positive integer, and 1 ⁇ i ⁇ N;
  • the processing unit 1001 is also used for:
  • the row data corresponding to the inconsistent row IDs are queried from the first database and the second database respectively according to the inconsistent row IDs.
  • the first data table includes the full amount of data in at least one data table in the first database.
  • the first data table includes incremental data in at least one data table in the first database.
  • the height of the first Merkle tree is associated with the first data table.
  • the first database and the second database are heterogeneous databases or homogeneous databases.
  • the first database is a relational database or a non-relational database
  • the second database is a relational database or a non-relational database.
  • the data verification device including a transceiver, a processor and a memory is used as an example for introduction.
  • FIG. 11 is a schematic structural diagram of a data verification device 1000 provided by the present application.
  • the device 1000 includes: a transceiver 1010 , a processor 1020 and a memory 1030 .
  • the transceiver 1010 , the processor 1020 and the memory 1030 communicate with each other through an internal connection path to transmit control and/or data signals.
  • the memory 1030 is used to store computer programs, and the processor 1010 is used to call from the memory 1030 And run the computer program to control the transceiver 1020 to send and receive signals.
  • the transceiver 1010 can be used to obtain the above-mentioned first data table and second data table, which will not be repeated here.
  • the functions of the processor 1020 correspond to the specific functions of the processing unit 1001 and the determination unit 1002 shown in FIG. 10 , and details are not repeated here.
  • the data verification device should include a processor.
  • the data verification device may be any one of the terminal devices described above.
  • the data verification device may further include a transceiver.
  • the data verification device may further include a memory.
  • FIG. 12 is a schematic structural diagram of a system 1200 provided by the present application.
  • the system 1200 includes: the data verification apparatus 1000 or the data verification device 1100 mentioned above.
  • the system 1200 may further include the above-mentioned first database and the above-mentioned second database.
  • Embodiments of the present application provide a computer program product, which when the computer program product runs on the data verification apparatus 1310, enables the data verification apparatus 1310 to execute the method 100 and/or the method 200 in the above method embodiments.
  • the disclosed systems, devices and methods may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the unit is only a logical function division.
  • there may be other division methods for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application are essentially or part of contributions to the prior art, or all or part of the technical solutions can be embodied in the form of software products, and the computer software products are stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer program instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on a computer, the procedures or functions according to the embodiments of the present application are generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program instructions may be transmitted from a website site, computer, server or data center via Wired or wireless transmission to another website site, computer, server or data center.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes one or more available media integrated.
  • the available media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, digital video discs (DVDs), or semiconductor media (eg, solid state drives), and the like.
  • the term "and/or” in this application is only an association relationship to describe associated objects, which means that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, and A and B exist at the same time. , there are three cases of B alone.
  • the character "/" in this document generally indicates that the contextual object is an "or” relationship; the term “at least one” in this application can mean “one” and "two or more", for example, A At least one of , B, and C can mean: A alone exists, B exists alone, C exists alone, A and B exist simultaneously, A and C exist simultaneously, C and B exist simultaneously, and A and B and C exist simultaneously. seven situations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data verification method (100, 200), apparatus, (130, 1000, 1210) and system (100, 1200). The method comprises: generating a first Merkle tree according to a first data table in a first database, generating a second Merkle tree according to a second data table in a second database, wherein the structure and generation method of the first Merkle tree and the second Merkle tree are exactly the same, and therefore, whether the first data table and the second data table are consistent can be determined by directly comparing hash values of root nodes of the two Merkle trees, such that the demand for accurate and fast verification of massive data can be met.

Description

数据校验方法、装置和系统Data verification method, device and system
本申请要求于2020年09月28日提交中国专利局、申请号为202011040390.7、申请名称为“数据校验方法、装置和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202011040390.7 and the application name "Data Verification Method, Apparatus and System" filed with the China Patent Office on September 28, 2020, the entire contents of which are incorporated into this application by reference middle.
技术领域technical field
本申请涉及存储领域,并且更具体地,涉及数据校验方法、装置和系统。The present application relates to the field of storage, and more particularly, to a data verification method, apparatus and system.
背景技术Background technique
在数据库(例如,异构数据库或同构数据库)系统数据同步或数据迁移的过程中,需要提供校验源端数据库和目标端数据库同步表的数据一致性,以验证数据同步或数据迁移的正确性。在实际应用中,通常存在源端数据库中的数据与目标端数据库中的数据不一致的问题。一方面,由于在数据传输和数据存储过程中,存在硬件故障、软件缺陷、人因差错、环境干扰等因素引起的数据丢失和数据错误,从而导致源端数据库中的数据与目标端数据库中的数据不一致。另一方面,由于数据库系统的性能问题,源端数据表的变化同步到目标端数据库时可能存在一定时间的延时,导致某一时刻源端数据库中的数据和目标端数据库中的数据表校验不一致。In the process of data synchronization or data migration of a database (for example, a heterogeneous database or a homogeneous database), it is necessary to verify the data consistency of the synchronization tables of the source database and the target database to verify the correctness of data synchronization or data migration. sex. In practical applications, there is usually a problem that the data in the source database is inconsistent with the data in the target database. On the one hand, in the process of data transmission and data storage, there are data loss and data errors caused by hardware failures, software defects, human errors, environmental interference and other factors, resulting in the data in the source database and the target database. Data is inconsistent. On the other hand, due to the performance problems of the database system, there may be a certain time delay when the changes of the source data table are synchronized to the target database, causing the data in the source database and the data table in the target database to be calibrated at a certain time. Inconsistent test.
传统的数据校验方法,无法满足对海量数据进行精准和快速校验的需求。Traditional data verification methods cannot meet the needs of accurate and fast verification of massive data.
发明内容SUMMARY OF THE INVENTION
本申请提供一种数据校验方法、装置和系统,可以应用在数据同步或数据迁移的场景中,该方法能够更好地满足对海量数据进行精准和快速校验的需求。The present application provides a data verification method, device and system, which can be applied in the scenarios of data synchronization or data migration, and the method can better meet the requirements for accurate and fast verification of massive data.
第一方面,提供了一种数据校验方法,其特征在于,该方法包括:A first aspect provides a data verification method, characterized in that the method includes:
对第一数据库中的第一数据表处理,生成第一默克尔树,该第一数据表的每行包括行标识和行数据,该第一默克尔树包括N个第一叶子节点,该N个第一叶子节点与N个第一哈希桶一一对应,每个该第一叶子节点的哈希值是根据对应的第一哈希桶确定的,该N个第一哈希桶是对该第一数据表按照行标识进行哈希分区得到的,任意两个第一哈希桶不相同,N为大于等于2的正整数;The first data table in the first database is processed to generate a first Merkle tree, each row of the first data table includes row identifiers and row data, and the first Merkle tree includes N first leaf nodes, The N first leaf nodes are in one-to-one correspondence with the N first hash buckets, the hash value of each first leaf node is determined according to the corresponding first hash bucket, and the N first hash buckets is obtained by hash partitioning the first data table according to the row identifier, any two first hash buckets are different, and N is a positive integer greater than or equal to 2;
对第二数据库中的第二数据表处理,生成第二默克尔树,该第二数据表是将该第一数据表同步或迁移到该第二数据库得到的,该第二默克尔树包括N个第二叶子节点,该N个第二叶子节点与N个第二哈希桶一一对应,每个该第二叶子节点的哈希值是根据对应的第二哈希桶确定的,该N个第二哈希桶是对该第二数据表按照行标识进行哈希分区得到的,对该第二数据表按照行标识进行哈希分区得到该N个第二哈希桶的哈希规则与对该第一数据表按照行标识进行哈希分区得到该N个第一哈希桶的哈希规则相同,任意两个第二哈希桶不相同;Process the second data table in the second database to generate a second Merkle tree, the second data table is obtained by synchronizing or migrating the first data table to the second database, the second Merkle tree Including N second leaf nodes, the N second leaf nodes are in one-to-one correspondence with N second hash buckets, and the hash value of each second leaf node is determined according to the corresponding second hash bucket, The N second hash buckets are obtained by hash partitioning the second data table according to row identifiers, and hash partitioning the second data table according to row identifiers to obtain the hashes of the N second hash buckets The rule is the same as the hash rule for obtaining the N first hash buckets by hash partitioning the first data table according to the row identifier, and any two second hash buckets are different;
将该第一默克尔树与该第二默克尔树进行比较,确定该第一数据表与该第二数据表是否一致。The first Merkle tree is compared with the second Merkle tree to determine whether the first data table is consistent with the second data table.
在上述技术方案中,第一数据表对应的第一默克尔树的每个第一叶子节点的哈希值是根据与其对应的第一哈希桶确定的,第二数据表对应的第二默克尔树的每个第二叶子节点的哈希值是根据与其对应的第二哈希桶确定的。由于对第二数据表按照行标识进行哈希分区得到N个第二哈希桶的哈希规则与对第一数据表按照行标识进行哈希分区得到N个第一哈希桶的哈希规则是相同的,故可以确保将第二数据表和第一数据表中的具有相同行标识对应的数据分别映射至具有相同序号的第二哈希桶和第一哈希桶,从而使得生成的第一默克尔树与第二默克尔树具有相同的结构。故可以直接通过比较上述两个默克尔树根节点的哈希值,确定第一数据表和第二数据表是否一致性。因此,本申请提供的数据校验的方法能够满足对海量数据进行精准和快速校验的需求。In the above technical solution, the hash value of each first leaf node of the first Merkle tree corresponding to the first data table is determined according to the corresponding first hash bucket, and the second data table corresponding to the second The hash value of each second leaf node of the Merkle tree is determined according to its corresponding second hash bucket. Because the hashing rule for obtaining N second hash buckets by hash partitioning the second data table according to row identifiers is different from the hashing rule for obtaining N first hash buckets by hashing partitioning the first data table according to row identifiers are the same, so it can be ensured that the data corresponding to the same row identifier in the second data table and the first data table are mapped to the second hash bucket and the first hash bucket with the same sequence number respectively, so that the generated The first Merkle tree has the same structure as the second Merkle tree. Therefore, it can be directly determined whether the first data table and the second data table are consistent by comparing the hash values of the above two Merkle tree root nodes. Therefore, the data verification method provided by the present application can meet the requirements of accurate and fast verification of massive data.
结合第一方面,在第一方面的某些实现方式中,该第一数据表包括M行,M为大于等于1的正整数,该对第一数据库中的第一数据表处理,生成第一默克尔树,包括:With reference to the first aspect, in some implementations of the first aspect, the first data table includes M rows, where M is a positive integer greater than or equal to 1, and the first data table in the first database is processed to generate the first Merkle tree, including:
对该第一数据表进行哈希处理,得到M个第一哈希组,该M个第一哈希组与该M行一一对应,每个该第一哈希组包括该第一数据表中的一个行标识和与该一个行标识对应的行数据的哈希值,每个该第一哈希组包括的行标识不相同;Hash processing is performed on the first data table to obtain M first hash groups, the M first hash groups are in one-to-one correspondence with the M rows, and each of the first hash groups includes the first data table A row identifier in and the hash value of the row data corresponding to the one row identifier, the row identifiers included in each of the first hash groups are different;
将该M个第一哈希组映射至该N个第一哈希桶;mapping the M first hash groups to the N first hash buckets;
根据该N个第一哈希桶确定该N个第一叶子节点的哈希值;Determine the hash values of the N first leaf nodes according to the N first hash buckets;
根据该N个第一叶子节点的哈希值,生成该第一默克尔树。The first Merkle tree is generated according to the hash values of the N first leaf nodes.
应理解的是,对第二数据库中的第二数据表处理,生成第二默克尔树的方法与上述方法相同。具体的,当第二数据表包括K行时,K为小于等于N的正整数,该对第二数据库中的第二数据表处理,生成第二默克尔树,包括:It should be understood that, for processing the second data table in the second database, the method for generating the second Merkle tree is the same as the above method. Specifically, when the second data table includes K rows, K is a positive integer less than or equal to N, and the second data table in the second database is processed to generate a second Merkle tree, including:
对该第二数据表进行哈希处理,得到K个第二哈希组,该K个第二哈希组与该K行一一对应,每个该第二哈希组包括该第二数据表中的一个行标识和与该一个行标识对应的行数据的哈希值,每个该第二哈希组包括的行标识不相同;Hash processing is performed on the second data table to obtain K second hash groups, the K second hash groups are in one-to-one correspondence with the K rows, and each of the second hash groups includes the second data table A row identifier in and the hash value of the row data corresponding to the one row identifier, the row identifiers included in each of the second hash groups are different;
将该K个第二哈希组映射至该N个第二哈希桶;mapping the K second hash groups to the N second hash buckets;
根据该N个第二哈希桶确定该N个第二叶子节点的哈希值;Determine the hash values of the N second leaf nodes according to the N second hash buckets;
根据该N个第二叶子节点的哈希值,生成该第二默克尔树。The second Merkle tree is generated according to the hash values of the N second leaf nodes.
其中,当K等于M时,可以理解为,第二数据表包括的行数目与第一数据表包括的行数目相同,也就是说,将第一数据表同步或迁移到第二数据库的过程中如果不存在数据缺失。当K小于M时,可以理解为,第二数据表包括的行数目小于第一数据表包括的行数目相同,也就是说,将第一数据表同步或迁移到第二数据库的过程中如果存在数据缺失。Wherein, when K is equal to M, it can be understood that the number of rows included in the second data table is the same as the number of rows included in the first data table, that is, in the process of synchronizing or migrating the first data table to the second database If there are no missing data. When K is less than M, it can be understood that the number of rows included in the second data table is the same as the number of rows included in the first data table, that is, in the process of synchronizing or migrating the first data table to the second database, if there is Data is missing.
还应理解的是,将该K个第二哈希组映射至该N个第二哈希桶的映射规则与将该M个第一哈希组映射至该N个第一哈希桶的映射规则是相同的。It should also be understood that the mapping rule for mapping the K second hash groups to the N second hash buckets is the same as the mapping rule for mapping the M first hash groups to the N first hash buckets. The rules are the same.
在上述技术方案中,采用数据分区算法将数据表中的行数据映射至哈希桶,哈希桶与默克尔树叶子节点一一对应,由于K个第二哈希组映射至N个第二哈希桶的映射规则与将M个第一哈希组映射至N个第一哈希桶的映射规则是相同的,故生成的第一默克尔树与第二默克尔树具有相同的结构。In the above technical solution, a data partition algorithm is used to map the row data in the data table to hash buckets, and the hash buckets correspond to the Merkle tree leaf nodes one-to-one. Since the K second hash groups are mapped to the Nth hash buckets The mapping rule for two hash buckets is the same as the mapping rule for mapping M first hash groups to N first hash buckets, so the generated first Merkle tree and the second Merkle tree have the same Structure.
结合第一方面,在第一方面的某些实现方式中,每个该第一叶子节点的哈希值是根据 对应的第一哈希桶包括的第一哈希组包括的哈希值进行异或运算得到的哈希值。With reference to the first aspect, in some implementations of the first aspect, the hash value of each of the first leaf nodes is differentiated according to the hash value included in the first hash group included in the corresponding first hash bucket. The hash value obtained by the OR operation.
应理解的是,每个第二叶子节点的哈希值是根据对应的第二哈希桶包括的第二哈希组包括的哈希值进行异或运算得到的哈希值。It should be understood that the hash value of each second leaf node is a hash value obtained by performing an XOR operation on the hash values included in the second hash group included in the corresponding second hash bucket.
在上述技术方案中,叶子节点的哈希值由哈希桶中的数据进行异或运算得到的,故在对第一数据表和第二数据表进行一致性校验时,可以避免对行数据进行排序处理,从而可以进一步提高数据校验的效率。In the above technical solution, the hash value of the leaf node is obtained by performing the XOR operation on the data in the hash bucket, so when the consistency check is performed on the first data table and the second data table, the row data can be avoided. The sorting process is performed, so that the efficiency of data verification can be further improved.
结合第一方面,在第一方面的某些实现方式中,该将该第一默克尔树与该第二默克尔树进行比较,确定该第一数据表与该第二数据表是否一致,包括:With reference to the first aspect, in some implementations of the first aspect, the first Merkle tree is compared with the second Merkle tree to determine whether the first data table is consistent with the second data table ,include:
确定该第一默克尔树根节点的哈希值和该第二默克尔树根节点的哈希值;determining the hash value of the root node of the first Merkle tree and the hash value of the root node of the second Merkle tree;
如果该第一默克树根节点的哈希值与该第二默克树根节点的哈希值相同,确定该第一数据表与该第二数据表一致;If the hash value of the first Merkle root node is the same as the hash value of the second Merkk tree root node, determine that the first data table is consistent with the second data table;
如果该第一默克树根节点的哈希值与该第二默克树根节点的哈希值不相同,确定该第一数据表与该第二数据表不一致。If the hash value of the root node of the first Merkke tree is different from the hash value of the root node of the second Merkke tree, it is determined that the first data table is inconsistent with the second data table.
在上述技术方案中,由于第一默克尔树和第二默克尔树的结构(例如,树高、叶子节点对应的数据表中的行标识)完全相同,故在进行一致性校验时,可以通过判断第一默克树根节点的哈希值与第二默克树根节点的哈希值是否相同,精准和快速的确定第一数据表与第二数据表是否一致。具体的,当第一默克尔树根节点的哈希值与第二默克尔树根节点的哈希值相同时,可以确定第一数据表与第二数据表是一致。当第一默克尔树根节点的哈希值与第二默克尔树根节点的哈希值不相同时,可以确定第一数据表与第二数据表不一致。In the above technical solution, since the structures of the first Merkle tree and the second Merkle tree (for example, the tree height and the row identifier in the data table corresponding to the leaf node) are exactly the same, when the consistency check is performed , it is possible to accurately and quickly determine whether the first data table and the second data table are consistent by judging whether the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree. Specifically, when the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree, it may be determined that the first data table and the second data table are consistent. When the hash value of the root node of the first Merkle tree is different from the hash value of the root node of the second Merkle tree, it may be determined that the first data table is inconsistent with the second data table.
结合第一方面,在第一方面的某些实现方式中,在确定该第一数据表与该第二数据表不一致之后,该方法还包括:In conjunction with the first aspect, in some implementations of the first aspect, after determining that the first data table is inconsistent with the second data table, the method further includes:
确定第i个第一叶子节点的哈希值与第i个第二叶子节点的哈希值不相同,该第i个第一叶子节点的哈希值是根据该第i个第一哈希桶确定的,该第i个第二叶子节点的哈希值是根据该第i个第二哈希桶确定的,i为正整数,且1≤i≤N;It is determined that the hash value of the ith first leaf node is different from the hash value of the ith second leaf node, and the hash value of the ith first leaf node is based on the ith first hash bucket Determined, the hash value of the ith second leaf node is determined according to the ith second hash bucket, i is a positive integer, and 1≤i≤N;
比较该第i个第一哈希桶与该第i个第二哈希桶,确定该第一数据表与该第二数据表不一致的行标识;Compare the i-th first hash bucket and the i-th second hash bucket, and determine the row identifiers that are inconsistent between the first data table and the second data table;
根据该不一致的行标识分别从该第一数据库和该第二数据库中查询该不一致的行标识对应的行数据。The row data corresponding to the inconsistent row IDs are queried from the first database and the second database respectively according to the inconsistent row IDs.
在上述技术方案中,在确定第一数据表和第二数据表不一致之后,还可以根据叶子节点的哈希值以及与该叶子节点对应的哈希桶,具体确定第一数据库和第二数据库中不一致的行标识。进行一步可以根据确定的不一致的行标识从第一数据库和第二数据库中查询不一致的行标识对应的行数据。由于叶子节点的数据集合大小是可控的,故上述确定不一致行标识所需要的时间也是可控的。In the above technical solution, after it is determined that the first data table and the second data table are inconsistent, according to the hash value of the leaf node and the hash bucket corresponding to the leaf node, it is also possible to specifically determine the data in the first database and the second database. Inconsistent row IDs. In one step, the row data corresponding to the inconsistent row IDs can be queried from the first database and the second database according to the determined inconsistent row IDs. Since the size of the data set of the leaf nodes is controllable, the time required for determining the inconsistent row identifiers above is also controllable.
结合第一方面,在第一方面的某些实现方式中,该第一数据表包括该第一数据库中的至少一个数据表中的全量数据。With reference to the first aspect, in some implementations of the first aspect, the first data table includes the full amount of data in at least one data table in the first database.
结合第一方面,在第一方面的某些实现方式中,该第一数据表包括该第一数据库中的至少一个数据表中的增量数据。In conjunction with the first aspect, in some implementations of the first aspect, the first data table includes incremental data in at least one data table in the first database.
在上述技术方案中,可以将全量数据与增量数据分开,作为两个阶段分别进行数据校 验,可以节省计算开销。具体的,由于全量数据包括的数据量较大,故对全量数据进行校验时可以构建层数较高的默克尔树。由于增量数据包括的数据量较小,故对增量数据进行校验时可以构建层数较低的默克尔树。In the above technical solution, the full data and the incremental data can be separated, and data verification can be performed as two stages, which can save computational overhead. Specifically, since the full amount of data includes a large amount of data, a Merkle tree with a higher number of layers can be constructed when verifying the full amount of data. Since the amount of data included in the incremental data is small, a Merkle tree with a lower number of layers can be constructed when verifying the incremental data.
结合第一方面,在第一方面的某些实现方式中,该第一默克尔树的高度与该第一数据表相关联。In conjunction with the first aspect, in some implementations of the first aspect, the height of the first Merkle tree is associated with the first data table.
在上述技术方案中,可以根据待校验的第一数据表的大小自适应调整第一默克尔树的高度。In the above technical solution, the height of the first Merkle tree can be adaptively adjusted according to the size of the first data table to be verified.
结合第一方面,在第一方面的某些实现方式中,该第一数据库与该第二数据库是异构数据库或同构数据库。With reference to the first aspect, in some implementations of the first aspect, the first database and the second database are heterogeneous databases or homogeneous databases.
在上述技术方案中,本申请提供的数据校验的方法能够适用于同构数据库的数据一致性校验和异构数据库的数据一致性校验。In the above technical solution, the method for data verification provided by the present application can be applied to the data consistency verification of homogeneous databases and the data consistency verification of heterogeneous databases.
结合第一方面,在第一方面的某些实现方式中,该第一数据库为关系型数据库或非关系型数据库,该第二数据库为关系型数据库或非关系型数据库。With reference to the first aspect, in some implementations of the first aspect, the first database is a relational database or a non-relational database, and the second database is a relational database or a non-relational database.
第二方面,提供了一种数据校验装置,该数据校验装置执行上述第一方面及第一方面的任意可能的实现方式中的方法。In a second aspect, a data verification apparatus is provided, and the data verification apparatus executes the method in the first aspect and any possible implementation manner of the first aspect.
应理解的是,本申请提供的数据校验装置与数据库系统之间是独立解耦的,故该数据校验装置不会对数据库系统造成侵入性影响。例如,影响数据库系统的功能、性能或者占用数据库系统资源等。It should be understood that the data verification device provided in the present application is independently decoupled from the database system, so the data verification device will not cause intrusive effects on the database system. For example, it affects the function and performance of the database system or occupies database system resources.
第三方面,提供了一种数据校验设备,该设备包括存储器和处理器,该存储器用于存储指令,该处理器用于读取该存储器中存储的指令,使得该数据校验设备执行上述第一方面及第一方面的任意可能的实现方式中的方法。In a third aspect, a data verification device is provided, the device includes a memory and a processor, the memory is used for storing instructions, and the processor is configured to read the instructions stored in the memory, so that the data verification device executes the above-mentioned first A method in an aspect and any possible implementation of the first aspect.
第四方面,提供了一种处理器,包括:输入电路、输出电路和处理电路。所述处理电路用于通过所述输入电路接收信号,并通过所述输出电路发射信号,使得所述第一方面中的任一方面,以及第一方面中任一种可能实现方式中的方法被实现。In a fourth aspect, a processor is provided, including: an input circuit, an output circuit, and a processing circuit. The processing circuit is configured to receive a signal through the input circuit and transmit a signal through the output circuit, so that any aspect of the first aspect and the method of any possible implementation of the first aspect are accomplish.
在具体实现过程中,上述处理器可以为芯片,输入电路可以为输入管脚,输出电路可以为输出管脚,处理电路可以为晶体管、门电路、触发器和各种逻辑电路等。输入电路所接收的输入的信号可以是由例如但不限于接收器接收并输入的,输出电路所输出的信号可以是例如但不限于输出给发射器并由发射器发射的,且输入电路和输出电路可以是同一电路,该电路在不同的时刻分别用作输入电路和输出电路。本申请实施例对处理器及各种电路的具体实现方式不做限定。In a specific implementation process, the above-mentioned processor may be a chip, the input circuit may be an input pin, the output circuit may be an output pin, and the processing circuit may be a transistor, a gate circuit, a flip-flop, and various logic circuits. The input signal received by the input circuit may be received and input by, for example, but not limited to, a receiver, the signal output by the output circuit may be, for example, but not limited to, output to and transmitted by a transmitter, and the input circuit and output The circuit can be the same circuit that acts as an input circuit and an output circuit at different times. The embodiments of the present application do not limit the specific implementation manners of the processor and various circuits.
第五方面,提供了一种处理装置,包括处理器和存储器。该处理器用于读取存储器中存储的指令,并可通过接收器接收信号,通过发射器发射信号,以执行第一方面以及第一方面任一种可能实现方式中的方法。In a fifth aspect, a processing apparatus is provided, including a processor and a memory. The processor is configured to read the instructions stored in the memory, and can receive signals through the receiver and transmit signals through the transmitter, so as to execute the first aspect and the method in any possible implementation manner of the first aspect.
可选地,所述处理器为一个或多个,所述存储器为一个或多个。Optionally, there are one or more processors and one or more memories.
可选地,所述存储器可以与所述处理器集成在一起,或者所述存储器与处理器分离设置。Optionally, the memory may be integrated with the processor, or the memory may be provided separately from the processor.
在具体实现过程中,存储器可以为非瞬时性(non-transitory)存储器,例如只读存储器(read only memory,ROM),其可以与处理器集成在同一块芯片上,也可以分别设置在不同的芯片上,本申请实施例对存储器的类型以及存储器与处理器的设置方式不做限 定。In the specific implementation process, the memory can be a non-transitory memory, such as a read only memory (ROM), which can be integrated with the processor on the same chip, or can be separately set in different On the chip, the embodiment of the present application does not limit the type of the memory and the setting manner of the memory and the processor.
应理解,相关的数据交互过程例如发送指示信息可以为从处理器输出指示信息的过程,接收能力信息可以为处理器接收输入能力信息的过程。具体地,处理输出的数据可以输出给发射器,处理器接收的输入数据可以来自接收器。其中,发射器和接收器可以统称为收发器。It should be understood that the relevant data interaction process, such as sending indication information, may be a process of outputting indication information from the processor, and receiving capability information may be a process of receiving input capability information by the processor. Specifically, the data output by the processing can be output to the transmitter, and the input data received by the processor can be from the receiver. Among them, the transmitter and the receiver may be collectively referred to as a transceiver.
第六方面,提供了一种计算机可读存储介质,用于存储计算机程序,该计算机程序包括用于执行上述第一方面及上述第一方面的任意可能的实现方式中的方法的指令。In a sixth aspect, a computer-readable storage medium is provided for storing a computer program, the computer program comprising instructions for executing the method in the above-mentioned first aspect and any possible implementation manner of the above-mentioned first aspect.
第七方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面及上述第一方面的任意可能的实现方式中的方法。In a seventh aspect, there is provided a computer program product comprising instructions that, when run on a computer, cause the computer to execute the method in the above-mentioned first aspect and any possible implementation manner of the above-mentioned first aspect.
第八方面,提供了一种系统,该系统包括第二方面所述的数据校验装置。In an eighth aspect, a system is provided, the system including the data verification apparatus described in the second aspect.
第九方面,提供了一种芯片,包括至少一个处理器和接口;所述至少一个所述处理器,用于调用并运行计算机程序,以使所述芯片执行上述第一方面及上述第一方面的任意可能的实现方式中的方法。In a ninth aspect, a chip is provided, including at least one processor and an interface; the at least one processor is used to call and run a computer program, so that the chip executes the above-mentioned first aspect and the above-mentioned first aspect method in any possible implementation of .
附图说明Description of drawings
图1是适用于本申请提供的数据校验方法的系统100的示意图。FIG. 1 is a schematic diagram of a system 100 suitable for the data verification method provided by the present application.
图2是本申请提供的数据校验装置130的示意图。FIG. 2 is a schematic diagram of the data verification apparatus 130 provided by the present application.
图3是本申请提供的数据校验方法100的示意性流程图。FIG. 3 is a schematic flowchart of the data verification method 100 provided by the present application.
图4是根据本申请提供的方法确定的默克尔树的示意图。FIG. 4 is a schematic diagram of a Merkle tree determined according to the method provided in this application.
图5是本申请提供的从第一数据表中抽取数据的示意图FIG. 5 is a schematic diagram of extracting data from the first data table provided by the present application
图6是本申请提供的对从第一数据表中抽取的数据进行哈希分区的示意图。FIG. 6 is a schematic diagram of hash partitioning the data extracted from the first data table provided by the present application.
图7是根据本申请提供的方法确定的默克尔树的示意图。FIG. 7 is a schematic diagram of a Merkle tree determined according to the method provided in the present application.
图8是本申请提供的数据校验方法200的示意性流程图。FIG. 8 is a schematic flowchart of a data verification method 200 provided by the present application.
图9是根据本申请提供的方法确定的默克尔树的示意图。FIG. 9 is a schematic diagram of a Merkle tree determined according to the method provided in the present application.
图10是本申请提供的一种数据校验装置1000的示意性结构图。FIG. 10 is a schematic structural diagram of a data verification apparatus 1000 provided by the present application.
图11是本申请提供的一种数据校验设备1000的示意性结构图。FIG. 11 is a schematic structural diagram of a data verification device 1000 provided by the present application.
图12是本申请提供的一种系统1200的结构示意图。FIG. 12 is a schematic structural diagram of a system 1200 provided by the present application.
具体实施方式detailed description
下面将结合附图,对本申请中的技术方案进行描述。The technical solutions in the present application will be described below with reference to the accompanying drawings.
本申请的实施方式部分使用的术语仅用于对本申请的具体实施例进行解释,而非旨在限定本申请。The terms used in the embodiments of the present application are only used to explain specific embodiments of the present application, and are not intended to limit the present application.
本申请中术语“第一”“第二”“第三”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”和“第三”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。In this application, the terms "first", "second", "third" and other words are used to distinguish the same or similar items that have substantially the same function and function. It should be understood that "first", "second" and "third" There is no logical or temporal dependency between "three", nor does it limit the quantity and execution order.
本申请将围绕可包括多个设备、组件、模块等的系统来呈现各个方面、实施例或特征。应当理解和明白的是,各个系统可以包括另外的设备、组件、模块等,并且/或者可以并不包括结合附图讨论的所有设备、组件、模块等。此外,还可以使用这些方案的组合。This application will present various aspects, embodiments, or features around a system that may include a plurality of devices, components, modules, and the like. It is to be understood and appreciated that the various systems may include additional devices, components, modules, etc., and/or may not include all of the devices, components, modules, etc. discussed in connection with the figures. In addition, combinations of these schemes can also be used.
另外,在本申请实施例中,“示例的”、“例如”等词用于表示作例子、例证或说明。本 申请中被描述为“示例”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用示例的一词旨在以具体方式呈现概念。In addition, in the embodiments of the present application, words such as "exemplary" and "for example" are used to represent examples, illustrations or illustrations. Any embodiment or design described in this application as an "example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of the word example is intended to present a concept in a concrete way.
本申请实施例中,“相应的(corresponding,relevant)”和“对应的(corresponding)”有时可以混用,应当指出的是,在不强调其区别时,其所要表达的含义是一致的。In the embodiments of the present application, "corresponding (corresponding, relevant)" and "corresponding (corresponding)" may sometimes be used interchangeably. It should be noted that, when the difference is not emphasized, the meanings to be expressed are the same.
本申请实施例中,有时候下标如W 1可能会笔误为非下标的形式如W1,在不强调其区别时,其所要表达的含义是一致的。 In the embodiments of the present application, sometimes a subscript such as W1 may be mistakenly written in a non-subscript form such as W1. When the difference is not emphasized, the meaning to be expressed is the same.
本申请实施例描述的网络架构以及业务场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着网络架构的演变和新业务场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The network architecture and service scenarios described in the embodiments of the present application are for the purpose of illustrating the technical solutions of the embodiments of the present application more clearly, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application. The evolution of the architecture and the emergence of new business scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
在本说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。References in this specification to "one embodiment" or "some embodiments" and the like mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically emphasized otherwise. The terms "including", "including", "having" and their variants mean "including but not limited to" unless specifically emphasized otherwise.
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。In this application, "at least one" means one or more, and "plurality" means two or more. "And/or", which describes the relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, it can indicate that A exists alone, A and B exist at the same time, and B exists alone, where A, B can be singular or plural. The character "/" generally indicates that the associated objects are an "or" relationship. "At least one item(s) below" or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one item (a) of a, b, or c can represent: a, b, c, ab, ac, bc, or abc, where a, b, c can be single or multiple .
下面,介绍本申请的相关技术:Below, the related technology of the present application is introduced:
为便于理解,在描述本申请提供的数据校验方法之前,首先对本申请中涉及的相关术语进行简单介绍。For ease of understanding, before describing the data verification method provided by the present application, related terms involved in the present application are briefly introduced first.
1、数据校验(data verify)1. Data verify
数据校验,是为保证数据的完整性进行的一种验证操作。通常用一种指定的算法对原始数据计算出的一个校验值,接收方用同样的算法计算一次校验值,如果两次计算得到的检验值相同,则说明数据是一致的。Data verification is a verification operation to ensure the integrity of the data. A check value is usually calculated on the original data by a specified algorithm. The receiver uses the same algorithm to calculate the check value once. If the check value obtained by the two calculations is the same, it means that the data is consistent.
2、数据复制(data replication)2. Data replication
数据复制,是将数据从一个位置复制到另一位置的技术,该技术涉及共享信息,以确保冗余资源(例如软件或硬件组件)之间的一致性,从而提高可靠性,容错性或可访问性。Data replication, the technique of copying data from one location to another, involves sharing information to ensure consistency between redundant resources (such as software or hardware components) to improve reliability, fault tolerance or reliability Accessibility.
3、默克尔树(merkel tree)3. Merkel tree (merkel tree)
默克尔树又可以称为哈希树。默克尔树是一种二叉树,由一个根节点、一组中间节点和一组叶节点组成。最下面的叶节点包含存储数据或其哈希值,每个中间节点是它的两个孩子节点内容的哈希值,根节点也是由它的两个子节点内容的哈希值组成。当数据抽取模块有新数据产生,会动态更新各自的默克尔树。Merkle tree can also be called hash tree. A Merkle tree is a binary tree consisting of a root node, a set of intermediate nodes and a set of leaf nodes. The bottommost leaf node contains the stored data or its hash value, each intermediate node is the hash value of the content of its two child nodes, and the root node is also composed of the hash value of the content of its two child nodes. When new data is generated in the data extraction module, the respective Merkle trees will be dynamically updated.
4、哈希(hash)4. Hash
哈希,是一个把任意长度的数据映射成固定长度数据的函数。输入数据的稍微改变就 会引起哈希运算结果的面目全非,而且根据哈希值反推原始输入数据的特征一般被认为是不可能的。Hash is a function that maps data of arbitrary length into data of fixed length. A slight change in the input data can cause the result of the hash operation to change beyond recognition, and it is generally considered impossible to reverse the characteristics of the original input data based on the hash value.
5、异构数据库(heterogeneous database,HDB)5. Heterogeneous database (HDB)
异构数据库,是相关的多个数据库系统的集合,可以实现数据的共享和透明访问,每个数据库系统在加入异构数据库系统之前本身就已经存在,拥有自己的数据库管理系统(database management system,DBMS)。异构数据库的各个组成部分具有自身的自治性,实现数据共享的同时,每个数据库系统仍保有自己的应用特性、完整性控制和安全性控制。Heterogeneous database is a collection of related multiple database systems, which can realize data sharing and transparent access. Each database system already exists before joining the heterogeneous database system, and has its own database management system (database management system, DBMS). Each component of a heterogeneous database has its own autonomy. While realizing data sharing, each database system still maintains its own application characteristics, integrity control and security control.
6、同构数据库6. Homogeneous database
同构数据库,是指所有站点都使用共同的DBMS软件,各个站点之间彼此了解且合作处理用户的需求。Homogeneous database means that all sites use a common DBMS software, and each site understands each other and cooperates to deal with the needs of users.
7、关系型数据库(relational database,RD)7. Relational database (RD)
RD,是指采用了关系模型来组织数据的数据库,其以行和列的形式存储数据,以便于用户理解,关系型数据库这一系列的行和列被称为表,一组表组成了数据库。用户通过查询来检索数据库中的数据,而查询是一个用于限定数据库中某些区域的执行代码。关系模型可以简单理解为二维表格模型,而一个关系型数据库就是由二维表及其之间的关系组成的一个数据组织。RD refers to a database that uses a relational model to organize data. It stores data in the form of rows and columns so that users can understand it. The series of rows and columns of a relational database are called tables, and a group of tables constitute a database. . A user retrieves data in a database through a query, which is an execution code that defines certain areas of the database. The relational model can be simply understood as a two-dimensional table model, and a relational database is a data organization composed of two-dimensional tables and the relationships between them.
8、非关系型数据库(not only sql,NoSQL)8. Non-relational database (not only sql, NoSQL)
NoSQL,是指采用了非关系模型来组织数据的数据库。非关系型数据库可以包括以下几种类型:键值存储数据库(例如,Oracle BDB)、列存储数据库(例如,HBase)、文档型数据库(例如,CouchDB或MongoDb)以及图形数据库等。NoSQL is a database that uses a non-relational model to organize data. Non-relational databases can include the following types: key-value store databases (eg, Oracle BDB), column store databases (eg, HBase), document databases (eg, CouchDB or MongoDb), and graph databases.
通常在源端数据库和目标端数据库同步数据的过程中,需要依赖物理介质或网络传输待同步的数据,使得源端数据库和目标端数据库存在一定延时。另外,待同步的数据在传输过程中也存在许多不确定的影响因素,例如,硬件故障、软件缺陷、人因差错、环境干扰等,都可能影响待同步的数据可靠性(例如,数据丢失或数据错误等)。上述原因都有可能导致源端数据库和目标端数据库存在不一致的问题。故在数据同步完成后,需要对目标端数据库存储的同步后的数据进行一致性校验,从而保证目标端数据库存储的数据的可靠性。目前通常采用离线数据校验方法,将应用软件运行产生的数据,从各个独立数据源的生产数据库(即,源端数据库),获取到统一的离线数据库(即,目标端数据库),从而使用离线数据校验各个独立数据源的数据是否一致。但是,由于从生产数据库获取数据需要占用数据库资源,因此若要减小对数据库性能的影响,需要降低获取数据的频率,甚至只能在业务量较小的时间进行,而这又将影响离线数据校验的时效性。此外,上述数据校验过程中通常需要对数据进行排序处理,数据排序处理往往需要占用大量的系统资源。因此,采用上述离线数据校验方法,对海量数据(例如,TB级别的数据)进行校验时通常无法满足业务需求。Usually, in the process of synchronizing data between the source database and the target database, it is necessary to rely on the physical medium or network to transmit the data to be synchronized, so that there is a certain delay between the source database and the target database. In addition, there are also many uncertain influencing factors during the transmission of the data to be synchronized, such as hardware failure, software defect, human error, environmental interference, etc., which may affect the reliability of the data to be synchronized (for example, data loss or data errors, etc.). The above reasons may cause inconsistency between the source database and the target database. Therefore, after the data synchronization is completed, it is necessary to perform a consistency check on the synchronized data stored in the target database, so as to ensure the reliability of the data stored in the target database. At present, the offline data verification method is usually adopted, and the data generated by the operation of the application software is obtained from the production database (ie, the source database) of each independent data source to a unified offline database (ie, the target database). Data check whether the data of each independent data source is consistent. However, since acquiring data from the production database requires database resources, in order to reduce the impact on database performance, the frequency of acquiring data needs to be reduced, even when the business volume is small, which in turn will affect offline data. Check the timeliness. In addition, in the above-mentioned data verification process, the data usually needs to be sorted, and the data sorting process often needs to occupy a large amount of system resources. Therefore, using the above-mentioned offline data verification method, when verifying massive data (for example, TB-level data), it is usually impossible to meet business requirements.
本申请提供一种数据校验方法、装置和系统,能够更好地满足对海量数据进行精准和快速校验的需求。The present application provides a data verification method, device and system, which can better meet the requirements for accurate and rapid verification of massive data.
为了便于理解,首先结合图1和图2,详细介绍适用于本申请提供的数据校验方法的系统和数据校验装置。In order to facilitate understanding, first, with reference to FIG. 1 and FIG. 2 , a system and a data verification device suitable for the data verification method provided by the present application will be introduced in detail.
图1是适用于本申请提供的数据校验方法的系统100的示意图。FIG. 1 is a schematic diagram of a system 100 suitable for the data verification method provided by the present application.
如图1所示,该系统100可以用于但不限用于以下场景:数据库的数据迁移场景或数据同步场景。该系统100可以包括至少一个源端数据库110、至少一个目标端数据库120和至少一个数据校验装置130。其中,数据校验装置130是独立于源端数据库110和目标端数据库120的第三方硬件设备上的系统,上述源端数据库110为数据迁移或复制前的数据库,上述目标端数据库120为数据迁移或复制后的数据库。As shown in FIG. 1 , the system 100 can be used in but not limited to the following scenarios: database data migration scenario or data synchronization scenario. The system 100 may include at least one source database 110 , at least one target database 120 and at least one data verification device 130 . The data verification device 130 is a system on a third-party hardware device independent of the source database 110 and the target database 120, the source database 110 is the database before data migration or replication, and the target database 120 is the data migration or a replicated database.
在本申请中,对上述源端数据库110的类型和上述目标端数据库120的类型不作具体限定。In this application, the type of the above-mentioned source database 110 and the type of the above-mentioned target database 120 are not specifically limited.
在一个示例中,上述源端数据库110或上述目标端数据库120可以为关系型数据库。例如,上述源端数据库110或上述目标端数据库120可以是以下关系型数据库中的任意一种:Oracle、DB2、Microsoft SQL Server、Microsoft Access、MySQL。应理解,此处关系型数据库的类型仅为示意并不对该系统100构成任何限定。例如,关系型数据库还可以为除上述列举之外的其它类型的关系型数据库。In one example, the above-mentioned source database 110 or the above-mentioned target database 120 may be a relational database. For example, the source database 110 or the target database 120 may be any one of the following relational databases: Oracle, DB2, Microsoft SQL Server, Microsoft Access, MySQL. It should be understood that the type of relational database here is only illustrative and does not constitute any limitation to the system 100 . For example, the relational database may also be other types of relational databases other than those listed above.
在另一个示例中,上述源端数据库110或上述目标端数据库120可以为非关系型数据库。例如,上述源端数据库110或上述目标端数据库120可以是以下非关系型数据库中的任意一种:NoSQL、Cloudant和MongoDB。应理解,此处非关系型数据库的类型仅为示意并不对该系统100构成任何限定。例如,非关系型数据库还可以为除上述列举之外的其它类型的非关系型数据库。In another example, the above-mentioned source-side database 110 or the above-mentioned target-side database 120 may be a non-relational database. For example, the above-mentioned source database 110 or the above-mentioned target database 120 may be any one of the following non-relational databases: NoSQL, Cloudant and MongoDB. It should be understood that the type of the non-relational database here is only illustrative and does not constitute any limitation to the system 100 . For example, the non-relational database may also be other types of non-relational databases other than those listed above.
在又一个示例中,上述源端数据库110可以为非关系型数据库,上述目标端数据库120可以为关系型数据库。例如,上述源端数据库110可以是NoSQL数据库,上述目标端数据库120可以是Oracle数据库。In yet another example, the source database 110 may be a non-relational database, and the target database 120 may be a relational database. For example, the source database 110 may be a NoSQL database, and the target database 120 may be an Oracle database.
在本申请中,源端数据库110和目标端数据库120可以是同构数据库。源端数据库110和目标端数据库120也可以是异构数据库,对此并不进行限定。In this application, the source database 110 and the target database 120 may be homogeneous databases. The source database 110 and the target database 120 may also be heterogeneous databases, which are not limited.
在本申请中,对上述源端数据库110、上述目标端数据库120和上述数据校验装置130在设备中的部署情况不作具体限定,但需要保证上述数据校验装置130是独立于上述源端数据库110和上述目标端数据库120的系统就行。In this application, the deployment of the above-mentioned source database 110, the above-mentioned target database 120 and the above-mentioned data verification device 130 in the equipment is not specifically limited, but it is necessary to ensure that the above-mentioned data verification device 130 is independent of the above-mentioned source database. 110 and the system of the above-mentioned target database 120 will suffice.
在一个示例中,上述源端数据库110可以为部署在物理设备#1上的实体模块或虚拟模块,上述目标端数据库120可以为部署在物理设备#2上的实体模块或虚拟模块,上述数据校验装置130可以为部署在物理设备#3上的实体模块或虚拟模块,且物理设备#1、物理设备#2和物理设备#3为不同的设备。In one example, the source database 110 may be a physical module or a virtual module deployed on physical device #1, and the target database 120 may be a physical module or virtual module deployed on physical device #2. The testing apparatus 130 may be a physical module or a virtual module deployed on the physical device #3, and the physical device #1, the physical device #2, and the physical device #3 are different devices.
在另一个示例中,上述源端数据库110和上述目标端数据库120可以为部署在物理设备#1上的不同的实体模块或虚拟模块,上述数据校验装置130可以为部署在物理设备#2上的实体模块或虚拟模块,且物理设备#1和物理设备#3为不同的设备。In another example, the source database 110 and the target database 120 may be different physical modules or virtual modules deployed on physical device #1, and the data verification apparatus 130 may be deployed on physical device #2 physical module or virtual module, and physical device #1 and physical device #3 are different devices.
参见图1,源端数据库110与目标端数据库120之间可以进行交互(例如,数据迁移或数据同步等),源端数据库110、目标端数据库120分别还可以与数据校验装置130进行交互。在将源端数据库110中的数据同步或迁移到目标端数据库120后,数据校验装置130可以从源端数据库110中抽取源端待校验的数据以及从目标端数据库120中抽取目标端待校验的数据,并对上述抽取的两部分数据进行一致性校验,以验证数据迁移或数据同步后的数据在源端数据库110和目标端数据库120中是否具有一致性。当数据校验装置130确定从源端数据库抽取的待校验的数据与从目标端数据库抽取的待校验的数据不具有 一致性时,还可以进一步确定具体是哪些数据不一致。当数据校验装置130具有存储功能时,还可以将一致性校验的结果存储至数据校验装置130中。Referring to FIG. 1 , the source database 110 and the target database 120 may interact (eg, data migration or data synchronization, etc.), and the source database 110 and the target database 120 may also interact with the data verification apparatus 130 respectively. After synchronizing or migrating the data in the source database 110 to the target database 120 , the data verification device 130 can extract the data to be verified at the source from the source database 110 and extract the data to be verified at the target from the target database 120 . Verify the data, and perform consistency check on the two extracted data to verify whether the data after data migration or data synchronization is consistent in the source database 110 and the target database 120 . When the data verification device 130 determines that the data to be verified extracted from the source database is inconsistent with the data to be verified extracted from the target database, it can further determine which data is inconsistent. When the data verification apparatus 130 has a storage function, the result of the consistency verification may also be stored in the data verification apparatus 130 .
应理解,图1仅为示意并不对本申请适用的系统构成任何限定。例如,该系统100中还可以包括更多数目的源端数据库110和/或目标端数据库120和/或数据校验装置130。例如,数据校验装置130还可以包括其它模块,如校验执行模块、源端待校验数据管理模块、目标端待校验数据管理模块等。It should be understood that FIG. 1 is for illustration only and does not constitute any limitation to the system to which the present application applies. For example, the system 100 may further include a larger number of source-end databases 110 and/or target-end databases 120 and/or data verification devices 130 . For example, the data verification apparatus 130 may further include other modules, such as a verification execution module, a source-side data management module to be verified, a target-side data management module to be verified, and the like.
下面,结合图2,介绍本申请提供的上述图1中的数据校验装置130的一种结构示意图。Below, with reference to FIG. 2 , a schematic structural diagram of the data verification apparatus 130 in FIG. 1 provided in the present application will be introduced.
图2是本申请提供的数据校验装置130的示意图。FIG. 2 is a schematic diagram of the data verification apparatus 130 provided by the present application.
如图2所示,该装置130可以包括:源端数据抽取模块131、源端处理模块132、目标端数据处理模块133、目标端数据抽取模块134、比较模块135和存储模块136。其中,上述模块可以通过内部连接通路相连。例如,源端处理模块132可以与比较模块135、源端数据抽取模块131以及目标端处理模块133进行交互。As shown in FIG. 2 , the apparatus 130 may include: a source data extraction module 131 , a source processing module 132 , a target data processing module 133 , a target data extraction module 134 , a comparison module 135 and a storage module 136 . Wherein, the above-mentioned modules may be connected through internal connection paths. For example, the source-side processing module 132 may interact with the comparison module 135 , the source-side data extraction module 131 , and the target-side processing module 133 .
源端数据抽取模块131,用于从源端数据库(例如,上述源端数据库110)中获取数据。例如,该源端数据抽取模块131可以从图1中的源端数据库110中获取数据。The source-end data extraction module 131 is configured to obtain data from a source-end database (for example, the above-mentioned source-end database 110 ). For example, the source data extraction module 131 can obtain data from the source database 110 in FIG. 1 .
源端处理模块131,用于从源端数据抽取模块131获取数据,以及对获取的数据进行哈希处理和数据分区处理等。The source-end processing module 131 is configured to acquire data from the source-end data extraction module 131, and perform hash processing and data partition processing on the acquired data.
目标端处理模块133,用于从目标端数据抽取模块134获取数据,以及对获取的数据进行哈希处理和数据分区处理等。The target-end processing module 133 is configured to obtain data from the target-end data extraction module 134, and perform hash processing and data partition processing on the obtained data.
目标端数据抽取模块134,用于从目标端数据库(例如,上述目标端数据库120)中获取数据。例如,该目标端数据抽取模块134可以从图1中的目标端数据库120中获取数据。The target-end data extraction module 134 is configured to obtain data from a target-end database (eg, the above-mentioned target-end database 120 ). For example, the target data extraction module 134 can obtain data from the target database 120 in FIG. 1 .
比较模块135,用于从源端处理模块132和目标端处理模块133获取数据对应的默克尔树,以及基于获取的默克尔树进行数据一致性验证。该比较模块135是上述数据校验装置130的核心模块。具体的,该比较模块135还可以包含数据对比子模块和数据反查子模块。其中,数据对比子模块可以通过默克尔树进行快速对比查找到不一致的行数据标识数据集,数据反查子模块可以根据不一致的行数据标识从数据库中反查到详细数据,最终找到不一致行标识对应的数据值。The comparison module 135 is configured to acquire the Merkle tree corresponding to the data from the source-end processing module 132 and the target-end processing module 133, and perform data consistency verification based on the acquired Merkle tree. The comparison module 135 is the core module of the above-mentioned data verification device 130 . Specifically, the comparison module 135 may further include a data comparison sub-module and a data reverse check sub-module. Among them, the data comparison sub-module can quickly compare and find inconsistent row data identification data sets through Merkle tree, and the data reverse check sub-module can reversely search detailed data from the database according to the inconsistent row data identification, and finally find inconsistent rows. Identifies the corresponding data value.
存储模块136,用于存储数据和指令。The storage module 136 is used to store data and instructions.
应理解的是,图2仅为示意并不对本申请提供的数据校验装置130构成任何限定。例如,该数据校验装置130中的源端处理模块132和目标端处理模块133还可以包括在同一个处理模块中。例如,该数据校验装置130中的源端数据抽取模块131和目标端数据抽取模块134还可以包括在同一个处理模块中。例如,当该数据校验装置130中的比较模块135具有存储模块136的功能时,该数据校验装置130还可以不包括存储模块136。It should be understood that FIG. 2 is only for illustration and does not constitute any limitation to the data verification apparatus 130 provided in the present application. For example, the source processing module 132 and the target processing module 133 in the data verification apparatus 130 may also be included in the same processing module. For example, the source data extraction module 131 and the target data extraction module 134 in the data verification apparatus 130 may also be included in the same processing module. For example, when the comparison module 135 in the data verification apparatus 130 has the function of the storage module 136 , the data verification apparatus 130 may also not include the storage module 136 .
下面,结合图3至图8对本申请提供的数据校验方法进行详细介绍。Below, the data verification method provided by the present application will be described in detail with reference to FIG. 3 to FIG. 8 .
图3是本申请提供的数据校验方法100的示意性流程图。FIG. 3 is a schematic flowchart of the data verification method 100 provided by the present application.
如图3所示,方法100可以包括步骤110至步骤130。下面对步骤110至步骤130进行详细介绍。其中,步骤110至步骤130的执行主体可以为图2所示的数据校验装置130。As shown in FIG. 3 , the method 100 may include steps 110 to 130 . Steps 110 to 130 will be described in detail below. The execution subject of steps 110 to 130 may be the data verification device 130 shown in FIG. 2 .
步骤110、对第一数据库中的第一数据表处理,生成第一默克尔树。Step 110: Process the first data table in the first database to generate a first Merkle tree.
第一数据库可以理解生产数据库,即源端数据库。例如,在一个示例中,第一数据库可以是图1所示的源端数据库110。The first database can understand the production database, that is, the source database. For example, in one example, the first database may be the source database 110 shown in FIG. 1 .
在本申请中,对第一数据表包括的数据来源和数据大小不做限定。In this application, the data source and data size included in the first data table are not limited.
在一个示例中,第一数据表可以包括第一数据库中的至少一个数据表中的全量数据。In one example, the first data table may include the full amount of data in at least one data table in the first database.
可选的,第一数据表中还可以包括两个或甚至更多个第一数据库中的数据表。在此情况下,第一数据表可以理解为是由第一数据库中的两个或多个数据表构成的数据集合。Optionally, the first data table may further include two or even more data tables in the first database. In this case, the first data table can be understood as a data set composed of two or more data tables in the first database.
例如,当第一数据库中包括数据表#1、数据表#2和数据表#3时,第一数据表可以包括数据表#1中的全部数据。第一数据表也可以包括数据表#1和数据表#3中的全部数据。第一数据表还可以包括数据表#1、数据表#2和数据表#3中的全部数据。For example, when the first database includes data table #1, data table #2 and data table #3, the first data table may include all data in data table #1. The first data table may also include all the data in data table #1 and data table #3. The first data table may also include all data in data table #1, data table #2, and data table #3.
在另一个示例中,第一数据表可以包括第一数据库中的至少一个数据表中的增量数据。也就是说,本申请提供的数据校验方法,还可以仅对数据表中变化的数据进行一致性校验。In another example, the first data table may include incremental data in at least one data table in the first database. That is to say, the data verification method provided by the present application can also perform consistency verification only on the changed data in the data table.
例如,在时刻#1,数据表#1和数据表#2具有一致性,其中数据表#2是对数据表#1进行复制后得到的。在时刻#1之后,数据表#1中的部分数据发生变化(例如,数据更新、数据增加或数据减少等)。在此情况下,可以认为上述数据表#1发生变化的数据为第一数据表包括的数据。For example, at time #1, data table #1 and data table #2 are consistent, wherein data table #2 is obtained by copying data table #1. After time #1, part of the data in data table #1 is changed (eg, data is updated, data is increased, or data is decreased, etc.). In this case, the changed data of the above-mentioned data table #1 can be considered as the data included in the first data table.
在本申请中,第一数据表的每行可以包括行标识和行数据,第一默克尔树包括N个第一叶子节点,N个第一叶子节点与N个第一哈希桶一一对应,每个第一叶子节点的哈希值是根据对应的第一哈希桶确定的,N个第一哈希桶是对第一数据表按照行标识进行哈希分区得到的,任意两个第一哈希桶不相同,N为大于等于2的正整数。In this application, each row of the first data table may include row identifiers and row data, and the first Merkle tree includes N first leaf nodes, N first leaf nodes and N first hash buckets one by one Correspondingly, the hash value of each first leaf node is determined according to the corresponding first hash bucket, and the N first hash buckets are obtained by hash partitioning the first data table according to the row identifier. Any two The first hash buckets are different, and N is a positive integer greater than or equal to 2.
在本申请中,对上述第一数据表包括的行标识的类型不作具体限定。In this application, the types of row identifiers included in the first data table are not specifically limited.
在一个示例中,行标识可以为数值型行标识。例如,数值型行标识可以为“5”。In one example, the row ID may be a numeric row ID. For example, a numeric row ID can be "5".
在另一个示例中,上述行标识还可以为字符串型行标识。例如,字符串型行标识可以为“张三”或“李四”等。In another example, the above row identifier may also be a string type row identifier. For example, the string-type row identifier can be "Zhang San" or "Li Si", etc.
当上述行标识为字符串型行标识时,在构建默克尔树之前,还需要对字符串型行标识进行处理,得到该字符串型行标识对应的哈希值。When the above row identifier is a string type row identifier, before constructing the Merkle tree, the string type row identifier needs to be processed to obtain a hash value corresponding to the string type row identifier.
在本申请中,N个第一哈希桶是对第一数据表按照行标识进行哈希分区得到的,可以理解为,无论第一数据表的行标识是数值型行标识还是字符串型行标识,在将第一数据表中的数据映射至哈希桶时,可以先对第一数据表包括的行标识进行哈希运算得到行标识对应的哈希值,然后对行标识对应的哈希值进行取模运算,再根据取模运算的结果确定该行标识所在的行对应的哈希桶。In this application, the N first hash buckets are obtained by hash partitioning the first data table according to row identifiers, which can be understood as whether the row identifiers of the first data table are numeric row identifiers or string row identifiers When mapping the data in the first data table to the hash bucket, you can first perform a hash operation on the row identifier included in the first data table to obtain a hash value corresponding to the row identifier, and then perform a hash operation on the row identifier corresponding to the row identifier. A modulo operation is performed on the value, and then the hash bucket corresponding to the row where the row identifier is located is determined according to the result of the modulo operation.
上述N个第一叶子节点与N个第一哈希桶一一对应,可以理解为,N个第一叶子节点中的第i个第一叶子节点(即序号为i的第一叶子节点)与N个第一哈希桶中的第i个第一哈希桶(即序号为i的第一哈希桶)对应。也就是说,序号为i的第一叶子节点与序号为i的第一哈希桶对应。其中,N个第一叶子节点中每个第一叶子节点的序号不相同,N个第一哈希桶中每个第一哈希桶的序号不相同,i为大于等于1且小于等于N的正整数。The above-mentioned N first leaf nodes correspond to the N first hash buckets one-to-one. It can be understood that the i-th first leaf node (that is, the first leaf node with serial number i) among the N first leaf nodes is the same as the The i-th first hash bucket (that is, the first hash bucket with the serial number i) in the N first hash buckets corresponds to. That is to say, the first leaf node with sequence number i corresponds to the first hash bucket with sequence number i. Among them, the serial number of each first leaf node in the N first leaf nodes is different, the serial number of each first hash bucket in the N first hash buckets is different, and i is greater than or equal to 1 and less than or equal to N positive integer.
上述N个第一哈希桶是对第一数据表按照行标识进行哈希分区得到的,本申请中对第一哈希桶与第一数据表的对应关系不作具体限定。The above N first hash buckets are obtained by hash partitioning the first data table according to row identifiers, and the corresponding relationship between the first hash bucket and the first data table is not specifically limited in this application.
在一个示例中,每个第一哈希桶是根据第一数据表中的一行确定的。此时,每个第一 哈希桶与第一数据表的一行对应。在此情况下,每个第一哈希桶包括的第一数据表中的行标识的数目是相同的。In one example, each first hash bucket is determined from a row in the first data table. At this time, each first hash bucket corresponds to a row of the first data table. In this case, the number of row identifiers in the first data table included in each first hash bucket is the same.
在另一个示例中,N个第一哈希桶中的至少一个第一哈希桶是根据第一数据表中的两行或多行确定的。此时,至少一个第一哈希桶与第一数据表的两行或多行对应。在此情况下,每个第一哈希桶包括的第一数据表中的行标识的数目可以不相同。In another example, at least one of the N first hash buckets is determined from two or more rows in the first data table. At this time, at least one first hash bucket corresponds to two or more rows of the first data table. In this case, the number of row identifiers in the first data table included in each first hash bucket may be different.
可选的,上述N个第一哈希桶中的至少一个第一哈希桶还可以为空。也就是说,N-1个第一哈希桶是根据第一数据表包括的所有行数据确定的,剩余的一个第一哈希桶就不包括第一数据表中的任何数据。Optionally, at least one of the N first hash buckets may also be empty. That is to say, the N-1 first hash buckets are determined according to all row data included in the first data table, and the remaining one first hash bucket does not include any data in the first data table.
上述任意两个第一哈希桶不同,可以理解为,任意两个第一哈希桶对应的序号不相同,且任意两个第一哈希桶包括的第一数据表中的行标识不相同。If any two of the above first hash buckets are different, it can be understood that the sequence numbers corresponding to any two first hash buckets are not the same, and the row identifiers in the first data table included in any two first hash buckets are not the same. .
例如,哈希桶#1的序号为1,且哈希桶#1包括第一数据表中的2个行标识,分别为“5”和“6”,哈希桶#2的序号为2,且哈希桶#2包括第一数据表中的1个行标识,为“1”。在此情况下,可以认为哈希桶#1与哈希桶#2不相同。For example, the serial number of hash bucket #1 is 1, and hash bucket #1 includes 2 row identifiers in the first data table, which are "5" and "6" respectively, and the serial number of hash bucket #2 is 2, And the hash bucket #2 includes 1 row identifier in the first data table, which is "1". In this case, hash bucket #1 can be considered to be different from hash bucket #2.
在一个示例中,第一数据表可以包括M行,M为大于等于N的正整数。在此情况下,上述对第一数据库中的第一数据表处理,生成第一默克尔树,可以包括如下步骤:In one example, the first data table may include M rows, where M is a positive integer greater than or equal to N. In this case, the above-mentioned processing of the first data table in the first database to generate the first Merkle tree may include the following steps:
对M行进行哈希处理,得到M个第一哈希组,M个第一哈希组与M行一一对应,每个第一哈希组包括M行中的一个行标识和与一个行标识对应的行数据的哈希值,每个第一哈希组包括的行标识不相同;Hash the M lines to obtain M first hash groups, the M first hash groups are in one-to-one correspondence with the M lines, and each first hash group includes a line identifier in the M lines and a line Identifies the hash value of the corresponding row data, and the row identifiers included in each first hash group are different;
将M个第一哈希组映射至N个第一哈希桶;Mapping the M first hash groups to the N first hash buckets;
根据N个第一哈希桶确定N个第一叶子节点的哈希值;Determine the hash values of the N first leaf nodes according to the N first hash buckets;
根据N个第一叶子节点的哈希值,生成第一默克尔树。Generate the first Merkle tree according to the hash values of the N first leaf nodes.
在本申请中,对N个第一哈希桶中的每个第一哈希桶包括的第一哈希组的数目不作具体限定。In this application, the number of first hash groups included in each of the N first hash buckets is not specifically limited.
例如,上述N个第一哈希桶中的每个第一哈希桶包括的第一哈希组的数目可以都是相同的。例如,上述N个第一哈希桶中的每个第一哈希桶包括的第一哈希组的数目也可以是不相同的。例如,上述N个第一哈希桶中的一部分第一哈希桶包括的第一哈希组的数目是相同的,其余部分第一哈希桶包括的第一哈希组的数目是不相同的。For example, the number of first hash groups included in each of the above N first hash buckets may be the same. For example, the number of first hash groups included in each of the above N first hash buckets may also be different. For example, the number of first hash groups included in a part of the first hash buckets in the above N first hash buckets is the same, and the number of first hash groups included in the remaining part of the first hash buckets is different of.
在本申请中,每个第一叶子节点的哈希值可以是根据对应的第一哈希桶包括的第一哈希组包括的哈希值进行异或运算得到的哈希值。应理解的是,当存在第一叶子节点对应的第一哈希桶不包括M个第一哈希组中的任意一个第一哈希组时,该第一叶子节点的哈希值可以为空。In this application, the hash value of each first leaf node may be a hash value obtained by performing an XOR operation on the hash values included in the first hash group included in the corresponding first hash bucket. It should be understood that when the first hash bucket corresponding to the first leaf node does not include any one of the M first hash groups, the hash value of the first leaf node may be empty. .
上述根据N个第一哈希桶确定N个第一叶子节点的哈希值,可以包括:The above determination of the hash values of the N first leaf nodes according to the N first hash buckets may include:
当N个第一叶子节点中的至少一个第一叶子节点对应的第一哈希桶包括M个第一哈希组中的至少一个第一哈希组时,至少一个第一叶子节点的哈希值是根据对应的第一哈希桶包括的M个第一哈希组中的至少一个第一哈希组包括的哈希值进行异或运算得到的哈希值。When the first hash bucket corresponding to at least one first leaf node among the N first leaf nodes includes at least one first hash group among the M first hash groups, the hash of the at least one first leaf node The value is a hash value obtained by performing an XOR operation on the hash values included in at least one of the M first hash groups included in the corresponding first hash bucket.
可选的,上述至少一个第一叶子节点对应的第一哈希桶还可以包括M个第一哈希组中的两个或多个第一哈希组。Optionally, the first hash bucket corresponding to the at least one first leaf node may further include two or more first hash groups among the M first hash groups.
当N个第一叶子节点中的至少一个第一叶子节点对应的第一哈希桶不包括M个第一 哈希组中的一个第一哈希组时,至少一个第一叶子节点的哈希值等于零。When the first hash bucket corresponding to at least one first leaf node among the N first leaf nodes does not include one first hash group among the M first hash groups, the hash of the at least one first leaf node value equal to zero.
上述第一默克尔树的高度与第一数据表相关联。在建立第一默克尔树之前,还需要根据第一数据表的大小确定第一默克尔树的相关参数,例如,第一默克尔树包括的叶子节点的数目、树高等。其中,第一默克尔树的树高会随着第一数据表包括的数据量的大小适应性变化。第一数据表包括的数据量越大,第一默克尔数的树高越高。换句话说,第一数据表包括的数据量较大(例如,1GB)时对应的第一默克树的高度高于第一数据表包括的数据量较小(例如,100MB)时对应的第一默克树的高度。The height of the first Merkle tree described above is associated with the first data table. Before establishing the first Merkle tree, it is also necessary to determine the relevant parameters of the first Merkle tree according to the size of the first data table, for example, the number of leaf nodes included in the first Merkle tree, and the tree height. Wherein, the tree height of the first Merkle tree will adaptively change with the size of the data included in the first data table. The larger the amount of data included in the first data table, the higher the tree height of the first Merkel number. In other words, when the first data table includes a relatively large amount of data (for example, 1GB), the height of the corresponding first Merk tree is higher than that when the first data table includes a relatively small amount of data (for example, 100MB). The height of a Merkle tree.
下面,以图4所示的默克尔树为例,介绍上述根据第一叶子节点的哈希值,生成第一默克尔树的方法。Next, taking the Merkle tree shown in FIG. 4 as an example, the above-mentioned method for generating the first Merkle tree according to the hash value of the first leaf node is introduced.
如图4所示的默克尔树(即,上述默克尔树的一例),该默克尔树的树高为3,中间节点数目为2,叶子节点数目为4。最顶层为根节点,次顶层为中间节点,再下一层是叶子节点,最底层为上文所述的哈希桶(即,上述第一哈希桶的一例)。As shown in FIG. 4 , a Merkle tree (ie, an example of the above-mentioned Merkle tree), the tree height of the Merkle tree is 3, the number of intermediate nodes is 2, and the number of leaf nodes is 4. The topmost layer is the root node, the second top layer is the intermediate node, the next layer is the leaf node, and the bottommost layer is the hash bucket described above (ie, an example of the first hash bucket above).
为便于描述,从左至右该默克尔树的4个叶子节点(即,上述第一叶子节点的一例)可以分别标记为:叶子节点1、叶子节点2、叶子节点3和叶子节点4。从左至右该默克尔树的4个哈希桶可以分别标记为:哈希桶1、哈希桶2、哈希桶3和哈希桶4。其中,默克尔树的4个叶子节点与4个哈希桶一一对应。具体的,叶子节点1与哈希桶1对应,叶子节点2与哈希桶2对应,叶子节点3与哈希桶3对应,叶子节点4与哈希桶4对应。For ease of description, from left to right, the four leaf nodes of the Merkle tree (ie, an example of the first leaf node above) may be respectively marked as: leaf node 1, leaf node 2, leaf node 3 and leaf node 4. From left to right, the four hash buckets of the Merkle tree can be marked as: hash bucket 1, hash bucket 2, hash bucket 3, and hash bucket 4. Among them, the four leaf nodes of the Merkle tree correspond to the four hash buckets one-to-one. Specifically, leaf node 1 corresponds to hash bucket 1, leaf node 2 corresponds to hash bucket 2, leaf node 3 corresponds to hash bucket 3, and leaf node 4 corresponds to hash bucket 4.
该默克尔树的每个叶子节点的哈希值是对该每个叶子节点对应的哈希桶包括的哈希值进行异或运算得到的。如叶子节点1的哈希值是根据哈希桶1包括的哈希值进行异或运算得到的,即叶子节点1的哈希值可以表示为N0=XOR(1,5),XOR(1,5)表示对哈希桶1包括的行标识为1的哈希值(即0xeffe898)和行标识为5的哈希值(即0xb8b8dd)进行异或运算的结果。The hash value of each leaf node of the Merkle tree is obtained by performing an XOR operation on the hash value included in the hash bucket corresponding to each leaf node. For example, the hash value of leaf node 1 is obtained by XOR operation according to the hash value included in hash bucket 1, that is, the hash value of leaf node 1 can be expressed as N0=XOR(1,5), XOR(1, 5) Indicates the result of performing the XOR operation on the hash value with the row ID of 1 (ie, 0xeffe898) and the hash value with the row ID of 5 (ie, 0xb8b8dd) included in the hash bucket 1.
该默克尔树的每个中间节点的哈希值是由其两个子节点的哈希值进行哈希运算得到的。例如,N4=H(N0,N1)表示默克尔树一个中间节点的哈希值,H(N0,N1)表示对该一个中间节点的两个叶子节点的哈希值(即,N0和N1)进行哈希运算的结果。The hash value of each intermediate node of the Merkle tree is obtained by hashing the hash values of its two child nodes. For example, N4=H(N0, N1) represents the hash value of an intermediate node of the Merkle tree, and H(N0, N1) represents the hash value of the two leaf nodes of this intermediate node (ie, N0 and N1 ) is the result of the hash operation.
该默克尔树根节点的哈希值是由其两个子节点的哈希值进行哈希运算得到的。例如,H(N4,N5)表示默克尔树根节点的哈希值。The hash value of the root node of the Merkle tree is obtained by hashing the hash values of its two child nodes. For example, H(N4,N5) represents the hash value of the root node of the Merkle tree.
上述第i个叶子节点#1,可以理解为,序号为i的叶子节点#1,第i个哈希桶#1,可以理解为,序号为i的哈希桶#1,i=1,2,3,4。The i-th leaf node #1 above can be understood as the leaf node #1 with the serial number i, and the i-th hash bucket #1, which can be understood as the hash bucket #1 with the serial number i, i=1,2 ,3,4.
应理解,图4仅为示意并不对本申请构成任何限定。例如,在一些实现方式中,图4所示的默克尔树还可以包括更多数目的叶子节点。例如,在一些实现方式中,图4所示的默克尔树还可以具有更高的树高。It should be understood that FIG. 4 is for illustration only and does not constitute any limitation to the present application. For example, in some implementations, the Merkle tree shown in FIG. 4 may also include a greater number of leaf nodes. For example, in some implementations, the Merkle tree shown in FIG. 4 may also have a higher tree height.
在步骤110之前,还可以包括如下操作:从第一数据库中获取第一数据表。Before step 110, the following operation may also be included: acquiring the first data table from the first database.
为便于描述,下面结合图5和图6具体介绍从第一数据库中获取第一数据表的方法。应理解,图5和图6仅为示意,并不对本申请获取第一数据表的方法构成任何限定。For ease of description, a method for acquiring the first data table from the first database will be specifically described below with reference to FIG. 5 and FIG. 6 . It should be understood that FIG. 5 and FIG. 6 are for illustration only, and do not constitute any limitation to the method for obtaining the first data table in the present application.
图5是本申请提供的从第一数据表中抽取数据的示意图。应理解,图5仅为示例。例如,数据表#1中还可以包括更多数目(例如,100行)或更少数目(例如,4行)的行数据。例如,数据抽取模块还可以包括更多数目线程。FIG. 5 is a schematic diagram of extracting data from the first data table provided by the present application. It should be understood that FIG. 5 is only an example. For example, a greater number (eg, 100 rows) or a lesser number (eg, 4 rows) of row data may also be included in data table #1. For example, the data extraction module may also include a higher number of threads.
如图5所示,从第一数据表中抽取数据的执行主体可以是数据抽取模块。具体的,该 数据抽取模块可以是图2所示的源端数据抽取模块131和目标端数据抽取模块134。也就是说,图2中的源端数据抽取模块131和目标端数据抽取模块134具有下文中描述的数据抽取功能。As shown in FIG. 5 , the execution body for extracting data from the first data table may be a data extraction module. Specifically, the data extraction module may be the source-end data extraction module 131 and the target-end data extraction module 134 shown in FIG. 2 . That is, the source-side data extraction module 131 and the target-side data extraction module 134 in FIG. 2 have the data extraction function described below.
在一个示例中,从第一数据表中抽取数据可以包括但不限于如下步骤:In one example, extracting data from the first data table may include, but is not limited to, the following steps:
将第一数据表按行划分为S批数据,S为大于等于1的正整数;Divide the first data table into S batches of data by row, where S is a positive integer greater than or equal to 1;
利用S个处理线程处理S批数据,S个处理线程与S批数据一一对应;Use S processing threads to process S batches of data, and S processing threads correspond to S batches of data one-to-one;
将处理后的S批数据入队至与S个队列,S批数据与S个队列一一对应。The processed S batches of data are enqueued into S queues, and the S batches of data are in one-to-one correspondence with the S queues.
上述处理线程的数目可以根据第一数据表的大小进行设置。例如,当第一数据表较大时,可以设置较多数目的处理线程。例如,当第一数据表较小时,可以设置较少数目的处理线程。The number of the above processing threads may be set according to the size of the first data table. For example, when the first data table is larger, a larger number of processing threads may be set. For example, when the first data table is smaller, a smaller number of processing threads may be provided.
在上述技术方案中,可以分批从第一数据表中抽取行数据,每一批数据可以由单独的线程来处理,这些线程之间可以并行执行,抽取后的数据放入相应的数据队列中。In the above technical solution, row data can be extracted from the first data table in batches, each batch of data can be processed by a separate thread, these threads can be executed in parallel, and the extracted data is put into the corresponding data queue .
可选的,也可以使用同一个处理线程处理数据表中的数据。Optionally, the same processing thread can also be used to process the data in the data table.
在一个示例中,上述数据抽取模块可以是图2中的源端数据抽取模块131,上述数据抽取模块可以是图2中的目标端数据抽取模块134。也就是说,图2中的源端数据抽取模块131和目标端数据抽取模块134具有上述数据抽取模块的功能。In an example, the above-mentioned data extraction module may be the source-side data extraction module 131 in FIG. 2 , and the above-mentioned data extraction module may be the target-side data extraction module 134 in FIG. 2 . That is to say, the source-side data extraction module 131 and the target-side data extraction module 134 in FIG. 2 have the functions of the above-mentioned data extraction modules.
参见图5,数据表#1(即,第一数据表的一例)中有8条数据,可以将第1条到第4条数据作为第一批数据,将第5条到第8条作为第二批数据。数据抽取模块中可以包括两个线程(thread)负责抽取数据,分别是线程#1和线程#2,线程#1可以负责抽取第一批数据,线程#2可以负责抽取第二批数据。线程#1将抽取后的数据放入队列(queue)#1中,线程#2将抽取后的数据放入队列#2中。数据抽取线程,即上述线程#1和上述线程#2可以并行执行,以提高抽取效率。Referring to Figure 5, there are 8 pieces of data in data table #1 (ie, an example of the first data table), and the first to fourth pieces of data can be regarded as the first batch of data, and the fifth to eighth pieces of data can be regarded as the first batch of data. Second batch of data. The data extraction module may include two threads responsible for extracting data, namely thread #1 and thread #2, thread #1 may be responsible for extracting the first batch of data, and thread #2 may be responsible for extracting the second batch of data. Thread #1 puts the extracted data into queue #1, and thread #2 puts the extracted data into queue #2. The data extraction threads, that is, the above-mentioned thread #1 and the above-mentioned thread #2 can be executed in parallel to improve the extraction efficiency.
图6是本申请提供的对从第一数据表中抽取的数据按照行标识进行哈希分区的示意图。FIG. 6 is a schematic diagram of hash partitioning data extracted from the first data table according to row identifiers provided by the present application.
如图6所示,从数据表#1(即,第一数据表的一例)中抽取的数据按照行标识进行哈希分区的执行主体可以是数据处理模块。具体的,该数据处理模块可以是图2所示的源端处理模块132和目标端数据处理模块133。也就是说,图2中的源端处理模块132和目标端数据处理模块133具有下文中描述的哈希分区功能。参见图6,数据表#1中包括8条数据,这8条数据对应的行标识分别为1,2,3,……,8。As shown in FIG. 6 , the execution subject for hash partitioning the data extracted from the data table #1 (ie, an example of the first data table) according to row identifiers may be a data processing module. Specifically, the data processing module may be the source-end processing module 132 and the target-end data processing module 133 shown in FIG. 2 . That is, the source-side processing module 132 and the target-side data processing module 133 in FIG. 2 have the hash partitioning function described below. Referring to FIG. 6 , data table #1 includes 8 pieces of data, and the row identifiers corresponding to these 8 pieces of data are 1, 2, 3, . . . , 8 respectively.
在本申请实施例中,对数据表#1按照行标识进行哈希分区,并将哈希分区后的结果映射至N(N=4)个哈希桶#1的过程可以如下所示:In the embodiment of the present application, the process of hash partitioning data table #1 according to row identifiers, and mapping the result after hash partitioning to N (N=4) hash buckets #1 may be as follows:
首先,对数据表#1包括的每行数据的行数据进行哈希处理得到行数据对应的哈希值,并将得到行数据对应的哈希值以及对应的行标识按顺序入队至哈希数据队列#1。为便于描述,可以将哈希数据队列#1中的每行记为1个哈希组(即,上述的第一哈希组的一例)。例如,哈希数据队列#1中的第1个哈希组的行标识为1,存储的哈希值为0xffe898。哈希数据队列#1中的第5个哈希组的行标识为1,存储的哈希值为0xb8bdd。具体的,参见图6,此处不再一一举例。First, perform hash processing on the row data of each row of data included in data table #1 to obtain the hash value corresponding to the row data, and then queue the obtained row data corresponding to the hash value and the corresponding row ID to the hash value in order Data queue #1. For convenience of description, each row in the hash data queue #1 may be recorded as one hash group (ie, an example of the above-mentioned first hash group). For example, the row ID of the first hash group in hash data queue #1 is 1, and the stored hash value is 0xffe898. The row ID of the 5th hash group in hash data queue #1 is 1, and the stored hash value is 0xb8bdd. Specifically, refer to FIG. 6 , which is not exemplified one by one here.
可选的,在一些实现方式中,还可以对每行数据的行标识和行数据都进行哈希运算,得到行标识对应的哈希值以及行数据对应的哈希值。Optionally, in some implementation manners, a hash operation may also be performed on the row identifier and row data of each row of data to obtain a hash value corresponding to the row identifier and a hash value corresponding to the row data.
然后,对哈希数据队列#1的行标识进行取模运算,并根据行标识取模运算的结果确定该行标识所在的行对应的哈希桶。具体的,哈希数据队列#1包括8个哈希组,可以对第1个哈希组的行标识1模4得到的结果为1,可以对第5个哈希组的行标识5模4得到的结果为1,故可以将第1个哈希组和第5个哈希组至第1个哈希桶#1(即序号为1的哈希桶)。也就是说,数据表#1包括的行标识为1的行数据映射至序号为1的哈希桶#1中。同样的,可以对哈希数据队列#1中的其它哈希组进行上述处理,可以得到第2个哈希组和第6个哈希组被映射至第2个哈希桶#1,第3个哈希组和第7个哈希组被映射至第3个哈希桶#1,第4个哈希组和第8个哈希组被映射至第4个哈希桶#1。Then, a modulo operation is performed on the row identifier of the hash data queue #1, and the hash bucket corresponding to the row where the row identifier is located is determined according to the result of the row identifier modulo operation. Specifically, the hash data queue #1 includes 8 hash groups, the result obtained by identifying the row of the first hash group with 1 modulo 4 is 1, and the row identification of the fifth hash group with 5 modulo 4 The obtained result is 1, so the first hash group and the fifth hash group can be transferred to the first hash bucket #1 (that is, the hash bucket with the serial number of 1). That is to say, the row data with the row ID of 1 included in the data table #1 is mapped into the hash bucket #1 with the serial number of 1. Similarly, the above processing can be performed on other hash groups in the hash data queue #1, and it can be obtained that the second hash group and the sixth hash group are mapped to the second hash bucket #1, the third hash group The 1st hash group and the 7th hash group are mapped to the 3rd hash bucket #1, and the 4th hash group and the 8th hash group are mapped to the 4th hash bucket #1.
应理解,图6仅为示例。例如,还可以包括更多数目(例如,8个)或更少数目(例如,2个)的哈希桶#1。例如,数据表#1中还可以包括更多数目(例如,100行)或更少数目(例如,4行)的行数据。It should be understood that FIG. 6 is only an example. For example, a greater number (eg, 8) or a lesser number (eg, 2) of hash bucket #1 may also be included. For example, a greater number (eg, 100 rows) or a lesser number (eg, 4 rows) of row data may also be included in data table #1.
步骤120、对第二数据库中的第二数据表处理,生成第二默克尔树。Step 120: Process the second data table in the second database to generate a second Merkle tree.
第二数据库可以理解为目标端数据库。例如,在一个示例中,第二数据库可以是图1所示的目标端数据库120。The second database can be understood as the target database. For example, in one example, the second database may be the target database 120 shown in FIG. 1 .
在本申请中,第二数据表是将第一数据表同步或迁移到第二数据库得到的,第二默克尔树包括N个第二叶子节点,N个第二叶子节点与N个第二哈希桶一一对应,每个第二叶子节点的哈希值是根据对应的第二哈希桶确定的,N个第二哈希桶是对第二数据表按照行标识进行哈希分区得到的,对第二数据表按照行标识进行哈希分区得到N个第二哈希桶的哈希规则与对第一数据表按照行标识进行哈希分区得到N个第一哈希桶的哈希规则相同,任意两个第二哈希桶不相同。In this application, the second data table is obtained by synchronizing or migrating the first data table to the second database, and the second Merkle tree includes N second leaf nodes, N second leaf nodes and N second leaf nodes. Hash buckets are in one-to-one correspondence, the hash value of each second leaf node is determined according to the corresponding second hash bucket, and the N second hash buckets are obtained by hash partitioning the second data table according to row identifiers , the hash rule for hash partitioning the second data table according to the row ID to obtain N second hash buckets and the hash partitioning for the first data table according to the row ID to obtain N first hash buckets The rules are the same, and any two second hash buckets are not the same.
在本申请中,第二默克尔树包括N个第一叶子节点,第一默克尔树也包括N个第一叶子节点。由于这两个默克尔树包括的叶子节点的数目相同,故可以认为第一默克尔树与第二默克尔树具有相同树高。也就是说,本申请提供的第一默克尔树与第二默克尔树具有相同树高。In this application, the second Merkle tree includes N first leaf nodes, and the first Merkle tree also includes N first leaf nodes. Since the two Merkle trees include the same number of leaf nodes, it can be considered that the first Merkle tree and the second Merkle tree have the same tree height. That is to say, the first Merkle tree and the second Merkle tree provided by the present application have the same tree height.
上述N个第二叶子节点与N个第二哈希桶一一对应,可以理解为,N个第二叶子节点中的第i个第二叶子节点(即序号为i的第二叶子节点)与N个第二哈希桶中的第i个第二哈希桶(即序号为i的第二哈希桶)对应。也就是说,序号为i的第二叶子节点与序号为i的第二哈希桶对应。其中,N个第二叶子节点中每个第二叶子节点的序号不相同,N个第二哈希桶中每个第二哈希桶的序号不相同,i为大于等于1且小于等于N的正整数。The above-mentioned N second leaf nodes are in one-to-one correspondence with the N second hash buckets. The i-th second hash bucket (that is, the second hash bucket with the serial number i) among the N second hash buckets corresponds to. That is to say, the second leaf node with the sequence number i corresponds to the second hash bucket with the sequence number i. Among them, the serial number of each second leaf node in the N second leaf nodes is different, the serial number of each second hash bucket in the N second hash buckets is different, and i is greater than or equal to 1 and less than or equal to N positive integer.
上述每个第二叶子节点的哈希值是根据对应的第二哈希桶确定的,具体的确定方法可以参见步骤110中的根据对应的第一哈希桶确定第一叶子节点的方法。The hash value of each second leaf node is determined according to the corresponding second hash bucket. For a specific determination method, refer to the method for determining the first leaf node according to the corresponding first hash bucket in step 110 .
上述N个第二哈希桶是对第二数据表按照行标识进行哈希分区得到的,本申请中对第二哈希桶与第二数据表的对应关系不作具体限定。The above-mentioned N second hash buckets are obtained by hash partitioning the second data table according to row identifiers, and the corresponding relationship between the second hash bucket and the second data table is not specifically limited in this application.
在一个示例中,每个第二哈希桶是根据第二数据表中的一行确定的。In one example, each second hash bucket is determined from a row in the second data table.
在另一个示例中,N个第二哈希桶中的至少一个第二哈希桶是根据第二数据表中的两行或多行确定的。In another example, at least one of the N second hash buckets is determined from two or more rows in the second data table.
在又一个示例中,上述N个第二哈希桶中的至少一个第二哈希桶还可以为空。In yet another example, at least one second hash bucket among the above N second hash buckets may also be empty.
上述N个第二叶子节点与N个第二哈希桶一一对应,可以理解为,N个第二叶子节点中的第i个第二叶子节点与N个第二哈希桶中的第i个哈希桶对应,i为大于等于1且 小于等于N的正整数。还应理解的是,N个第二叶子节点中每个第二叶子节点的序号不相同。The above N second leaf nodes correspond to the N second hash buckets one-to-one, which can be understood as the i-th second leaf node in the N second leaf nodes and the i-th second leaf node in the N second hash buckets. corresponding to each hash bucket, i is a positive integer greater than or equal to 1 and less than or equal to N. It should also be understood that the sequence numbers of each of the N second leaf nodes are different.
上述第二数据表是将第一数据表同步或迁移到第二数据库得到的,可以包括以下情况:The above-mentioned second data table is obtained by synchronizing or migrating the first data table to the second database, and may include the following situations:
将第一数据表同步或迁移到第二数据库的过程中如果不存在数据缺失,则第二数据表包括的行数目与第一数据表包括的行数目相同。If there is no data missing during the process of synchronizing or migrating the first data table to the second database, the number of rows included in the second data table is the same as the number of rows included in the first data table.
例如,第一数据表包括10行,每行包括行标识和行数据,在将一数据表同步或迁移到第二数据表的过程中如果不存在数据缺失,则同步或迁移后可以得到第二数据表也包括10行。For example, the first data table includes 10 rows, and each row includes row identifiers and row data. During the process of synchronizing or migrating a data table to the second data table, if there is no data missing, the second data table can be obtained after synchronization or migration. The data table also includes 10 rows.
将第一数据表同步或迁移到第二数据库的过程中如果存在数据缺失,则第二数据表包括的行数目小于第一数据表包括的行数目相同。If there is data missing in the process of synchronizing or migrating the first data table to the second database, the number of rows included in the second data table is the same as the number of rows included in the first data table.
例如,第一数据表包括10行,每行包括行标识和行数据,在将一数据表同步或迁移到第二数据表的过程中如果缺失了1行数据,则同步或迁移后得到第二数据表包括9行。For example, the first data table includes 10 rows, and each row includes row identifiers and row data. During the process of synchronizing or migrating a data table to the second data table, if one row of data is missing, the second data table will be obtained after synchronization or migration. The data table includes 9 rows.
也就是说,本申请中的第二数据表包括的行数目可以与第一数据表包括的行数目相同,或者,第二数据表包括的行数目也可以小于第一数据表包括的行数目相同。That is to say, the number of rows included in the second data table in this application may be the same as the number of rows included in the first data table, or the number of rows included in the second data table may also be smaller than the number of rows included in the first data table. .
上述对第二数据表按照行标识进行哈希分区得到N个第二哈希桶的哈希规则与对第一数据表按照行标识进行哈希分区得到N个第一哈希桶的哈希规则相同,可以理解的是,由于对第二数据表与第一数据表按照行标识进行分区时的哈希规则是一样的,因此可以保证对第二数据表和第一数据表中的相同行标识对应的数据,会映射至具有相同标号的哈希桶。The above hashing rule for hash partitioning the second data table according to row identifiers to obtain N second hash buckets and the hashing rule for hashing partitioning the first data table according to row identifiers to obtain N first hash buckets It can be understood that since the hash rules for partitioning the second data table and the first data table according to row identifiers are the same, it can be guaranteed that the same row identifiers in the second data table and the first data table are The corresponding data will be mapped to the hash bucket with the same label.
上述任意两个第二哈希桶不同,可以理解为,任意两个第二哈希桶对应的序号不相同,且任意两个非空第二哈希桶包括的行标识不相同。If any two second hash buckets are different, it can be understood that the sequence numbers corresponding to any two second hash buckets are different, and the row identifiers included in any two non-empty second hash buckets are different.
在本申请中,当第二数据表包括K行,且第一数据表包括M行时,K为小于等于M的正整数,对第二数据库中的第二数据表处理,生成第二默克尔树,可以包括:In this application, when the second data table includes K rows and the first data table includes M rows, K is a positive integer less than or equal to M, and the second data table in the second database is processed to generate a second Merck Er tree, which can include:
对K行进行哈希处理,得到K个第二哈希组,K个第二哈希组与K行一一对应,每个第二哈希组包括K行中的一个行标识和与一个行标识对应的行数据的哈希值,每个第二哈希组包括的行标识不相同;Hash the K rows to obtain K second hash groups, the K second hash groups are in one-to-one correspondence with the K rows, and each second hash group includes a row identifier in the K rows and a row identifying the hash value of the corresponding row data, and the row identifiers included in each second hash group are different;
根据将K个第二哈希组映射至N个第二哈希桶;According to mapping K second hash groups to N second hash buckets;
根据N个第二哈希桶确定N个第二叶子节点的哈希值;Determine the hash values of the N second leaf nodes according to the N second hash buckets;
根据N个第二叶子节点的哈希值,生成第二默克尔树。A second Merkle tree is generated according to the hash values of the N second leaf nodes.
其中,当K等于M时,可以理解为,第二数据表包括的行数目与第一数据表包括的行数目相同,也就是说,将第一数据表同步或迁移到第二数据库的过程中如果不存在数据缺失。当K小于M时,可以理解为,第二数据表包括的行数目小于第一数据表包括的行数目相同,也就是说,将第一数据表同步或迁移到第二数据库的过程中如果存在数据缺失。Wherein, when K is equal to M, it can be understood that the number of rows included in the second data table is the same as the number of rows included in the first data table, that is, in the process of synchronizing or migrating the first data table to the second database If there are no missing data. When K is less than M, it can be understood that the number of rows included in the second data table is the same as the number of rows included in the first data table, that is, in the process of synchronizing or migrating the first data table to the second database, if there is Data is missing.
应理解的是,将K个第二哈希组映射至N个第二哈希桶的映射规则与将M个第一哈希组映射至N个第一哈希桶的映射规则相同,即对第二数据表按照行标识进行哈希分区得到N个第二哈希桶的哈希规则与对第一数据表按照行标识进行哈希分区得到N个第一哈希桶的哈希规则相同。It should be understood that the mapping rule for mapping K second hash groups to N second hash buckets is the same as the mapping rule for mapping M first hash groups to N first hash buckets, that is, to The hash rule for obtaining N second hash buckets by hash partitioning the second data table according to row identifiers is the same as the hash rule for obtaining N first hash buckets by hash partitioning the first data table according to row identifiers.
例如,序号为1的第一哈希组包括的行标识为1和对应的行数据的哈希值,且该序号 为1的第一哈希组映射至序号为5的第一哈希桶。当第二数据表包括行标识为1的行时,行标识为1的行对应序号为1的第二哈希组,且该序号为1的第二哈希组映射至序号为5的第二哈希桶。For example, the row identifier included in the first hash group with the sequence number 1 is 1 and the hash value of the corresponding row data, and the first hash group with the sequence number 1 is mapped to the first hash bucket with the sequence number 5. When the second data table includes a row with a row ID of 1, the row with the row ID of 1 corresponds to a second hash group with a sequence number of 1, and the second hash group with a sequence number of 1 is mapped to a second hash group with a sequence number of 5. Hash bucket.
还应理解的是,当与第二叶子节点的哈希值对应的第二哈希桶为空时,该第二叶子节点的哈希值也为空。It should also be understood that when the second hash bucket corresponding to the hash value of the second leaf node is empty, the hash value of the second leaf node is also empty.
在步骤120之前,还可以包括从第二数据库中获取第二数据表。Before step 120, it may also include acquiring a second data table from a second database.
本申请对获取第二数据表的方式不作具体限定。The present application does not specifically limit the manner of acquiring the second data table.
例如,可以通过在第一数据库的第一数据表中打入校验标记,在第二数据库从第一数据库中复制的过程中,当在第二数据库中检测到上述校验标记时,将带有上述校验标记的数据作为第二数据表中的数据。For example, a check mark can be entered in the first data table of the first database, and in the process of copying the second database from the first database, when the above check mark is detected in the second database, the check mark can be marked in the second database. The data with the above-mentioned check mark is used as the data in the second data table.
其中,步骤120中未详细说明的内容与上述步骤110所述的内容相同,具体参见上述步骤110,此处不再详细赘述。Wherein, the content not described in detail in step 120 is the same as the content described in the foregoing step 110. For details, refer to the foregoing step 110, which will not be described in detail here.
步骤130、将第一默克尔树与第二默克尔树进行比较,确定第一数据表与第二数据表是否一致。Step 130: Compare the first Merkle tree with the second Merkle tree to determine whether the first data table is consistent with the second data table.
第一数据表与第二数据表是否一致,可以理解为,第一数据表与第二数据表中存储的行标识的数目相同,且相同行标识对应的行数据的内容也相同。Whether the first data table is consistent with the second data table can be understood as the number of row identifiers stored in the first data table and the second data table is the same, and the content of the row data corresponding to the same row identifier is also the same.
在本申请中,将第一默克尔树与第二默克尔树进行比较,确定第一数据表与第二数据表是否一致,可以包括:In this application, comparing the first Merkle tree with the second Merkle tree to determine whether the first data table is consistent with the second data table may include:
确定第一默克尔树根节点的哈希值和第二默克尔树根节点的哈希值;Determine the hash value of the root node of the first Merkle tree and the hash value of the root node of the second Merkle tree;
如果第一默克树根节点的哈希值与第二默克树根节点的哈希值相同,确定第一数据表与第二数据表一致;If the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree, it is determined that the first data table is consistent with the second data table;
如果第一默克树根节点的哈希值与第二默克树根节点的哈希值不相同,确定第一数据表与第二数据表不一致。If the hash value of the root node of the first Merkle tree is different from the hash value of the root node of the second Merkk tree, it is determined that the first data table is inconsistent with the second data table.
在上述技术方案中,由于默克尔树各节点的哈希值是根据各节点的子节点进行哈希运算得到的,例如,根节点的哈希值是根据根节点对应的两个中间节点确定的,叶子节点的哈希值是根据叶子节点对应的哈希桶中的数据确定的。因此,当第一默克尔树根节点的哈希值与第二默克尔树根节点的哈希值相同时,可以认为第一数据表与第二数据表具有一致性。当第一默克尔树根节点的哈希值与第二默克尔树根节点的哈希值相同时,可以认为第一数据表与第二数据表存在差异,即不一致。In the above technical solution, since the hash value of each node of the Merkle tree is obtained by performing hash operation on the child nodes of each node, for example, the hash value of the root node is determined according to the two intermediate nodes corresponding to the root node. Yes, the hash value of the leaf node is determined according to the data in the hash bucket corresponding to the leaf node. Therefore, when the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree, it can be considered that the first data table and the second data table are consistent. When the hash value of the root node of the first Merkle tree is the same as the hash value of the root node of the second Merkle tree, it may be considered that there is a difference between the first data table and the second data table, that is, they are inconsistent.
可选的,在确定第一数据表与第二数据表不一致之后,还可以包括如下操作:Optionally, after it is determined that the first data table is inconsistent with the second data table, the following operations may also be included:
确定第i个第一叶子节点的哈希值与第i个第二叶子节点的哈希值不相同,第i个第一叶子节点的哈希值是根据第i个第一哈希桶确定的,第i个第二叶子节点的哈希值是根据第i个第二哈希桶确定的,i为正整数,且1≤i≤N;It is determined that the hash value of the ith first leaf node is different from the hash value of the ith second leaf node, and the hash value of the ith first leaf node is determined according to the ith first hash bucket , the hash value of the ith second leaf node is determined according to the ith second hash bucket, i is a positive integer, and 1≤i≤N;
比较第i个第一哈希桶与第i个第二哈希桶,确定第一数据表与第二数据表不一致的行标识;Compare the i-th first hash bucket with the i-th second hash bucket, and determine the row identifiers that are inconsistent between the first data table and the second data table;
根据不一致的行标识分别从第一数据库和第二数据库中查询不一致的行标识对应的行数据。The row data corresponding to the inconsistent row IDs are queried from the first database and the second database respectively according to the inconsistent row IDs.
上述确定第i个第一叶子节点的哈希值与第i个第二叶子节点的哈希值不相同,例如,可以先将序号为1的第一叶子节点的哈希值和序号为1的第二叶子节点的哈希值进行比较 确定哈希值相同的情况下,继续比较序号为2的叶子节点的哈希值,以此类推,直到确定序号为i的第一叶子节点的哈希值与序号为i的第二叶子节点的哈希值不相同。其中序号为i的第一叶子节点可以理解为第i个第一叶子节点,序号为i的第二叶子节点可以理解为第i个第二叶子节点。例如,还可以先将序号为N的第一叶子节点的哈希值和序号为N的第二叶子节点的哈希值进行比较确定哈希值相同的情况下,继续比较序号为N-1的叶子节点的哈希值,以此类推,直到确定序号为i的第一叶子节点的哈希值与序号为i的第二叶子节点的哈希值不相同。上述比较第i个第一哈希桶与第i个第二哈希桶,确定第一数据表与第二数据表不一致的行标识,可以包括:It is determined above that the hash value of the i-th first leaf node is different from the hash value of the i-th second leaf node. If the hash value of the second leaf node is compared and it is determined that the hash value is the same, continue to compare the hash value of the leaf node with serial number 2, and so on, until the hash value of the first leaf node with serial number i is determined It is not the same as the hash value of the second leaf node with sequence number i. The first leaf node with serial number i can be understood as the ith first leaf node, and the second leaf node with serial number i can be understood as the ith second leaf node. For example, it is also possible to compare the hash value of the first leaf node with the serial number N and the hash value of the second leaf node with the serial number N. If it is determined that the hash values are the same, continue to compare the hash value of the serial number N-1. The hash value of the leaf node, and so on, until it is determined that the hash value of the first leaf node with sequence number i is not the same as the hash value of the second leaf node with sequence number i. The above-mentioned comparison of the i-th first hash bucket and the i-th second hash bucket to determine the row identifiers that are inconsistent between the first data table and the second data table may include:
确定第i个第一哈希桶中的一个哈希组包括的行标识与第i个第二哈希桶中的一个哈希组包括的行标识相同但对应的哈希值不相同;It is determined that the row identifier included in a hash group in the i-th first hash bucket is the same as the row identifier included in a hash group in the i-th second hash bucket, but the corresponding hash values are different;
确定上述行标识为第一数据表与第二数据表不一致的行标识。It is determined that the above row identifier is a row identifier that is inconsistent between the first data table and the second data table.
在上述技术方案中,当定位到不一致的行标识后,需进一步定位不一致的行数据。在本实施例中,定位不一致的行数据的过程可以是这样的:首先从不一致的行数据中提取行数据标识,通过行数据标识分别从源端数据库和目标端数据库中查找对应的行数据,该行数据包含各行数据数据,通过直接比值的方式找到不一致的行数据。In the above technical solution, after the inconsistent row identifiers are located, the inconsistent row data needs to be further located. In this embodiment, the process of locating inconsistent row data may be as follows: first, row data identifiers are extracted from inconsistent row data, and corresponding row data are searched from the source database and the target database respectively through the row data identifiers, The row data contains the data of each row, and the inconsistent row data is found by means of direct ratio.
可以理解的是,上述数据对比模块进行数据一致性比较时是自顶向下的比较过程。找到具体的不一致的数据后,由数据存储模块将这些信息存储到非易失存储介质中,供需要时查询。It can be understood that, when the above data comparison module performs data consistency comparison, it is a top-down comparison process. After finding specific inconsistent data, the data storage module stores the information in a non-volatile storage medium for query when needed.
作为示例,下面结合图7所示的两个默克尔树,介绍根据上述步骤130提供的方法,对这两个默克尔树进行一致性比较的过程。As an example, in conjunction with the two Merkle trees shown in FIG. 7 , the following describes the process of comparing the two Merkle trees for consistency according to the method provided in the foregoing step 130 .
如图7所示的两个默克尔树可以是根据上述步骤110和上述步骤120得到的,为便于描述将这两个默克尔树记为默克尔树#1(即,上述第一默克尔树的一例)和默克尔树#2(即,上述第二默克尔树的一例)。The two Merkle trees shown in FIG. 7 can be obtained according to the above-mentioned steps 110 and 120. For the convenience of description, these two Merkle trees are denoted as Merkle tree #1 (that is, the above-mentioned first An example of a Merkle tree) and Merkle tree #2 (ie, an example of the second Merkle tree described above).
下面,以默克尔树#1为例介绍各节点的哈希值。同样的,默克尔树#2各节点的哈希值也是通过类似方法得到的。默克尔树#1的4个叶子节点#1(即,上述第一叶子节点的一例)对应4个哈希桶#1(即,上述第一哈希桶的一例),从左至右,分别记这4个叶子节点#1为第1个叶子节点#1、第2个叶子节点#1、第3个叶子节点#1、第4个叶子节点#1。从左至右,分别记这4个哈希桶#1为第1个哈希桶#1、第2个哈希桶#1、第3个哈希桶#1、第4个哈希桶#1。每个哈希桶#1包括2个哈希组,每个哈希组包括行标识和哈希值。第1个叶子节点#1的哈希值是根据第1个哈希桶#1中的2个哈希组包括的哈希值进行异或运算得到的,即第1个叶子节点#1的哈希值N0=XOR(1,5),其中,XOR(1,5)表示对行标识为1对应的哈希值与行标识为5对应的哈希值进行异或运算,即XOR(1,5)等于XOR(011,101)(011,101)。中间节点#1的哈希值是由其两个子节点的哈希值进行哈希运算得到的。例如,N4=H(N0,N1)表示默克尔树#1一个中间节点#1的哈希值,H(N0,N1)表示对该一个中间节点#1的两个叶子节点#1的哈希值(即,N0和N1)进行哈希运算的结果。根节点#1的哈希值是由其两个子节点的哈希值进行哈希运算得到的。例如,H(N4,N5)表示默克尔树#1根节点#1的哈希值。参见图7,对默克尔树#1和默克尔树#2进行一致性校验的过程可以为:首先比较根节点的哈希值是否相同。由于默克尔树#1的第4个哈希桶#1与默克尔树#2的第4个哈希桶#2不相同(即行标识为“6”的哈希组对应的 哈希值不相同),故得到的默克尔树#1根节点#1的哈希值(即默克尔树#1中的H(N4,N5))与默克尔树#2根节点#2的哈希值(即默克尔树#2中的H(N4,N5))不相同。接下来,自顶向下寻找确定默克尔树#1与默克尔树#2不一致的叶子节点。根据图7可知,可以确定第4个叶子节点#1与第4个叶子节点#2不相同。再根据第4个叶子节点#1与第4个叶子节点#2对应的哈希桶确定不一致的行标识为“6”。Below, the hash value of each node is described by taking Merkle tree #1 as an example. Similarly, the hash value of each node of Merkle tree #2 is obtained by a similar method. Four leaf nodes #1 of Merkle tree #1 (ie, an example of the first leaf node) correspond to four hash buckets #1 (ie, an example of the first hash bucket). From left to right, The four leaf nodes #1 are respectively recorded as the first leaf node #1, the second leaf node #1, the third leaf node #1, and the fourth leaf node #1. From left to right, record these 4 hash buckets #1 as the first hash bucket #1, the second hash bucket #1, the third hash bucket #1, and the fourth hash bucket# 1. Each hash bucket #1 includes 2 hash groups, and each hash group includes a row ID and a hash value. The hash value of the first leaf node #1 is obtained by XORing the hash values included in the two hash groups in the first hash bucket #1, that is, the hash value of the first leaf node #1 The Greek value N0=XOR(1,5), where XOR(1,5) means to perform an XOR operation on the hash value corresponding to the row ID 1 and the hash value corresponding to the row ID 5, that is, XOR(1, 5) is equal to XOR(011,101)(011,101). The hash value of intermediate node #1 is obtained by hashing the hash values of its two child nodes. For example, N4=H(N0, N1) represents the hash value of an intermediate node #1 of Merkle tree #1, and H(N0, N1) represents the hash value of the two leaf nodes #1 of this intermediate node #1 The result of hashing the hash values (ie, N0 and N1). The hash value of root node #1 is obtained by hashing the hash values of its two child nodes. For example, H(N4, N5) represents the hash value of root node #1 of Merkle tree #1. Referring to FIG. 7 , the process of performing consistency check on Merkle tree #1 and Merkle tree #2 may be: first, compare whether the hash values of the root nodes are the same. Since the 4th hash bucket #1 of Merkle tree #1 is not the same as the 4th hash bucket #2 of Merkle tree #2 (that is, the hash value corresponding to the hash group with row ID "6" are not the same), so the hash value of root node #1 of Merkle tree #1 (that is, H(N4, N5) in Merkle tree #1) is the same as that of root node #2 of Merkle tree #2. The hash values (i.e. H(N4, N5) in Merkle tree #2) are not the same. Next, a top-down search is performed for leaf nodes that determine that Merkle tree #1 is inconsistent with Merkle tree #2. As can be seen from FIG. 7 , it can be determined that the fourth leaf node #1 is different from the fourth leaf node #2. Then, according to the hash buckets corresponding to the fourth leaf node #1 and the fourth leaf node #2, it is determined that the inconsistent row identifier is "6".
上述第i个叶子节点#1,可以理解为,序号为i的叶子节点#1,第i个哈希桶#1,可以理解为,序号为i的哈希桶#1,i=1,2,3,4。The i-th leaf node #1 above can be understood as the leaf node #1 with the serial number i, and the i-th hash bucket #1, which can be understood as the hash bucket #1 with the serial number i, i=1,2 ,3,4.
应理解,图7仅为示意并不对本申请构成任何限定。例如,图7所示的哈希桶#1和哈希桶#1中的每个哈希桶#1可以包括不同数目的哈希组。例如,图7所示的哈希桶#1和哈希桶#1中包括的哈希值还可以为更大量级的哈希值。It should be understood that FIG. 7 is only for illustration and does not constitute any limitation to the present application. For example, each of hash bucket #1 and hash bucket #1 shown in FIG. 7 may include a different number of hash groups. For example, the hash values included in hash bucket #1 and hash bucket #1 shown in FIG. 7 may also be hash values of a larger magnitude.
在本申请中,对上述步骤110至步骤130中涉及的第一数据库与第二数据库的类型不作具体限定。In this application, the types of the first database and the second database involved in the above steps 110 to 130 are not specifically limited.
可选的,第一数据库与第二数据库可以是异构数据库。Optionally, the first database and the second database may be heterogeneous databases.
可选的,第一数据库与第二数据库可以是同构数据库。Optionally, the first database and the second database may be homogeneous databases.
可选的,第一数据库可以为关系型数据库或非关系型数据库,第二数据库可以为关系型数据库或非关系型数据库。Optionally, the first database may be a relational database or a non-relational database, and the second database may be a relational database or a non-relational database.
例如,第一数据库可以是关系型数据库,第二数据库可以是非关系型数据库。例如,第一数据库可以是关系型数据库,第二数据库可以是关系型数据库。For example, the first database may be a relational database, and the second database may be a non-relational database. For example, the first database may be a relational database, and the second database may be a relational database.
关系型数据库通常是以表格形式存储数据的,故可以直接从关系型数据库中抽取数据构建第一数据表,非关系型数据库通常是以非表格形式(例如,文档、键值或图结构等)存储数据的,故从非关系型数据库中抽取数据构建第一数据表之前,还需要将非关系型数据库中待校验的数据转换为按表存储的形式,其中,表的每行可以包括行标识和一个或多个行数据。Relational databases usually store data in tabular form, so data can be directly extracted from relational databases to construct the first data table, while non-relational databases are usually in non-tabular form (for example, documents, key-value or graph structures, etc.) Therefore, before extracting data from the non-relational database to construct the first data table, it is also necessary to convert the data to be verified in the non-relational database into the form of table storage, wherein each row of the table may include a row ID and one or more row data.
在大多数场景下,生产数据库(即,上述第一数据库的一例)中的数据都是动态变化的,在进行数据校验的过程中,已经校验的数据又发生了改变,需要重新校验。如果采用在原默克尔树上更新的方式,每一个叶子的更新都会引起其上各层哈希值的重新计算,当默克尔树的层数较高时,这种计算开销会急剧增加,另外一个需要考虑的因素的是当增量数据快速增加时,考虑到整体校验性能,原默克尔树需要扩容与重建,这会对已经校验过的数据产生影响。In most scenarios, the data in the production database (that is, an example of the above-mentioned first database) changes dynamically. During the process of data verification, the verified data changes again, which needs to be verified again. . If the method of updating on the original Merkle tree is adopted, the update of each leaf will cause the recalculation of the hash values of the layers above it. When the number of layers of the Merkle tree is high, the calculation overhead will increase sharply. Another factor to consider is that when the incremental data increases rapidly, considering the overall verification performance, the original Merkle tree needs to be expanded and rebuilt, which will affect the verified data.
本申请提供的数据校验方法,还可以对全量数据和增量数据分别进行校验。具体的,对全量数据采用层数较高的默克尔树进行校验,增量数据采用层数较低的默克尔树进行校验,从而能够有效降低计算开销。The data verification method provided by this application can also verify the full data and the incremental data respectively. Specifically, a Merkle tree with a higher number of layers is used for verification on the full amount of data, and a Merkle tree with a lower number of layers is used for verification on the incremental data, thereby effectively reducing computational overhead.
在本申请中,对从数据表中获取待校验的增量数据的方法不作具体限定。例如,可以采用现有的获取数据表中待校验的增量数据的方法。例如,也可以采用其它的获取待校验的增量数据的方法。In this application, the method for acquiring incremental data to be verified from the data table is not specifically limited. For example, an existing method for acquiring incremental data to be verified in a data table may be used. For example, other methods for acquiring incremental data to be verified may also be used.
作为示例,数据表#1(即,上述第一数据表的一例)包括10行数据,数据表#2(即,上述第二数据表的一例)包括10行数据且数据表#2是对数据表#1进行复制得到的。在时刻#1,可以采用上述步骤110至步骤130的方法,根据数据表#1生成默克尔树#1(即,上述第一默克尔树的一例),根据数据表#2生成默克尔树#2(即,上述第二默克尔树的 一例),并通过比较默克尔树#1与默克尔树#2确定数据表#1和数据表#2是否具有一致性。在时刻#1之后的一个时刻,数据表#1中的第5至第10行的数据发生了更新,此时可以记更新后的数据表#1为数据表#3(即,上述第一数据表的一例),数据表#4(即,上述第二数据表的一例)是对数据表#3进行复制得到的。在此情况下,可以只对增量数据采用上述步骤110至步骤130的方法进行一致性校验。具体的,根据数据表#3生成默克尔树#3(即,上述第一默克尔树的一例),根据数据表#4生成默克尔树#4(即,上述第二默克尔树的一例),并通过比较默克尔树#3与默克尔树#4确定数据表#3和数据表#4是否具有一致性。As an example, data table #1 (ie, an example of the first data table described above) includes 10 rows of data, data table #2 (ie, an example of the second data table described above) includes 10 rows of data and data table #2 is a pair of data Table #1 was reproduced. At time #1, the methods from steps 110 to 130 above can be used to generate Merkle tree #1 (ie, an example of the first Merkle tree above) according to data table #1, and Merkle tree #2 to be generated according to data table #2 Merkle tree #2 (ie, an example of the above-mentioned second Merkle tree), and by comparing Merkle tree #1 and Merkle tree #2, it is determined whether data table #1 and data table #2 are consistent. At a time after time #1, the data in rows 5 to 10 in data table #1 is updated, and at this time, the updated data table #1 can be recorded as data table #3 (that is, the above-mentioned first data An example of a table), data table #4 (that is, an example of the above-mentioned second data table) is obtained by duplicating data table #3. In this case, the methods of steps 110 to 130 above may be used to perform consistency check only on incremental data. Specifically, Merkle tree #3 (that is, an example of the above-mentioned first Merkle tree) is generated according to data table #3, and Merkle tree #4 (that is, the above-mentioned second Merkle tree) is generated according to data table #4 An example of a tree), and by comparing Merkle tree #3 and Merkle tree #4 to determine whether Data Table #3 and Data Table #4 are consistent.
本申请提供的数据校验方法,能够更好地满足对不同场景下(例如,在线数据或离线数据)海量数据进行精准和快速校验的需求。在上述技术方案中,第一数据表对应的第一默克尔树的每个第一叶子节点的哈希值是根据与其对应的第一哈希桶确定的,第二数据表对应的第二默克尔树的每个第二叶子节点的哈希值是根据与其对应的第二哈希桶确定的。由于对第二数据表按照行标识进行哈希分区得到N个第二哈希桶的哈希规则与对第一数据表按照行标识进行哈希分区得到N个第一哈希桶的哈希规则是相同的,故可以确保将第二数据表和第一数据表中的具有相同行标识对应的数据分别映射至具有相同序号的第二哈希桶和第一哈希桶,从而使得生成的第一默克尔树与第二默克尔树具有相同的结构。故可以直接通过比较上述两个默克尔树根节点的哈希值,确定第一数据表和第二数据表是否一致性。在生成第一默克尔树和第二默克尔树时,由于上述两个默克尔树的叶子节点的哈希值是通过对应的哈希桶中的数据进行异或运算得到的,故在对第一数据表和第二数据表进行一致性校验时,可以避免对行数据进行排序处理。The data verification method provided by the present application can better meet the requirements for accurate and fast verification of massive data in different scenarios (eg, online data or offline data). In the above technical solution, the hash value of each first leaf node of the first Merkle tree corresponding to the first data table is determined according to the corresponding first hash bucket, and the second data table corresponding to the second The hash value of each second leaf node of the Merkle tree is determined according to its corresponding second hash bucket. Because the hashing rule for obtaining N second hash buckets by hash partitioning the second data table according to row identifiers is different from the hashing rule for obtaining N first hash buckets by hashing partitioning the first data table according to row identifiers are the same, so it can be ensured that the data corresponding to the same row identifier in the second data table and the first data table are mapped to the second hash bucket and the first hash bucket with the same sequence number respectively, so that the generated The first Merkle tree has the same structure as the second Merkle tree. Therefore, it can be directly determined whether the first data table and the second data table are consistent by comparing the hash values of the above two Merkle tree root nodes. When generating the first Merkle tree and the second Merkle tree, since the hash values of the leaf nodes of the above two Merkle trees are obtained by performing the XOR operation on the data in the corresponding hash buckets, so When the consistency check is performed on the first data table and the second data table, sorting processing of row data can be avoided.
当确定上述两个默克尔树根节点的哈希值不一致时,本申请提供的数据校验方法还可以通过比较上述两个默克尔树叶子节点的哈希值,进一步确定不一致的行标识和对应的行数据。为了满足不同场景数据校验的需要,本申请提供的数据校验方法还可以根据待校验数据集的大小自适应调整第一默克尔树和第二默克尔树的树的树高。When it is determined that the hash values of the above two Merkle tree root nodes are inconsistent, the data verification method provided by the present application can further determine the inconsistent row identifiers by comparing the hash values of the above two Merkle tree root nodes and the corresponding row data. In order to meet the needs of data verification in different scenarios, the data verification method provided by the present application can also adaptively adjust the tree heights of the first Merkle tree and the second Merkle tree according to the size of the data set to be verified.
下面,结合图8具体介绍本申请提供的数据校验方法200的校验流程。Next, the verification process of the data verification method 200 provided by the present application will be described in detail with reference to FIG. 8 .
如图8所示,该方法200包括步骤210至步骤290,下面对步骤210至步骤290进行介绍。其中,步骤210至步骤290的执行主体可以是图2所示的数据校验装置130。As shown in FIG. 8 , the method 200 includes steps 210 to 290 , and the steps 210 to 290 are described below. The execution subject of steps 210 to 290 may be the data verification apparatus 130 shown in FIG. 2 .
步骤210,开始。Step 210, start.
上述步骤210表示开始进行数据一致性校验。The above-mentioned step 210 indicates that the data consistency check is started.
步骤220,对数据库#1(即,上述方法100中的第一数据库的一例)中待复制的数据打入校验标记位。In step 220, check mark bits are added to the data to be replicated in database #1 (ie, an example of the first database in the above method 100).
在待复制的数据中打入校验标记的方法可以与现有的方法相同,此处不再赘述。The method of inserting a check mark into the data to be copied may be the same as the existing method, and details are not described herein again.
步骤230,数据库#2(即,上述方法100中的第二数据库的一例)复制上述待复制的数据。Step 230: Database #2 (ie, an example of the second database in the above-mentioned method 100) replicates the above-mentioned data to be replicated.
步骤240,数据库#1检测到标记位,获取数据表#1。Step 240, database #1 detects the flag bit and acquires data table #1.
从数据库#1检测到标记位后获取数据表#1的方法可以与现有的方法相同,此处不再赘述。The method of acquiring the data table #1 after detecting the flag bit from the database #1 can be the same as the existing method, and details are not described here.
步骤250,数据库#2检测到标记位,获取数据表#2。Step 250, database #2 detects the flag bit, and acquires data table #2.
步骤260,根据数据表#1,生成默克尔树#1(即,上述方法100中的第一默克尔树的 一例)。Step 260, according to the data table #1, generate a Merkle tree #1 (that is, an example of the first Merkle tree in the above method 100).
步骤270,根据数据表#2,生成默克尔树#2(即,上述方法100中的第二默克尔树的一例)Step 270, according to data table #2, generate Merkle tree #2 (ie, an example of the second Merkle tree in the above method 100)
上述步骤260和上述步骤270中确定默克尔树的方法与方法100中确定默克尔树的方法相同,具体参见上文中的步骤110,此处不再详细赘述。The method for determining the Merkle tree in the above steps 260 and 270 is the same as the method for determining the Merkle tree in the method 100. For details, refer to the above step 110, which will not be described in detail here.
步骤280,确定默克尔树#1根节点的哈希值与默克尔树#2根节点的哈希值是否相同。Step 280: Determine whether the hash value of the root node of Merkle tree #1 is the same as the hash value of the root node of Merkle tree #2.
上述确定默克尔树#1根节点的哈希值与默克尔树#2根节点的哈希值是否相同,包括:The above determines whether the hash value of the root node of Merkle tree #1 is the same as the hash value of the root node of Merkle tree #2, including:
在确定默克尔树#1根节点的哈希值与默克尔树#2根节点的哈希值相同的情况下,执行步骤281;When it is determined that the hash value of the root node of Merkle tree #1 is the same as the hash value of the root node of Merkle tree #2, step 281 is performed;
在确定默克尔树#1根节点的哈希值与默克尔树#2根节点的哈希值不相同的情况下,执行步骤282和步骤283。If it is determined that the hash value of the root node of Merkle tree #1 is not the same as the hash value of the root node of Merkle tree #2, step 282 and step 283 are performed.
步骤281,确定数据表#1与数据表#2具有一致性。Step 281, it is determined that the data table #1 and the data table #2 are consistent.
步骤282,比较默克尔树#1叶子节点的哈希值和默克尔树#2叶子节点的哈希值,确定不一致的行标识。Step 282: Compare the hash value of the leaf node of Merkle tree #1 with the hash value of the leaf node of Merkle tree #2, and determine inconsistent row identifiers.
具体参见上文中的步骤130中的方法,此处不再详细赘述。For details, refer to the method in step 130 above, which will not be described in detail here.
在本申请中,对执行步骤281和步骤282的执行顺序不作具体限定。In this application, the execution sequence of step 281 and step 282 is not specifically limited.
例如,在执行步骤280之后可以先执行步骤281再执行步骤282。例如,在执行步骤280之后可以先执行步骤282再执行步骤281。For example, after step 280 is performed, step 281 may be performed first and then step 282 may be performed. For example, after step 280 is performed, step 282 may be performed first and then step 281 may be performed.
步骤283,根据上述确定的不一致的行标识,从数据库#1和数据库#2中确定不一致的行数据。Step 283: Determine inconsistent row data from database #1 and database #2 according to the above determined inconsistent row identifiers.
具体参见上文中的步骤130中的方法,此处不再详细赘述。For details, refer to the method in step 130 above, which will not be described in detail here.
可选的,在步骤283之后,还可以将确定的不一致的行标识和对应的行数据存储在数据校验装置的存储模块中,例如,图2所示的存储模块136。Optionally, after step 283, the determined inconsistent row identifiers and corresponding row data may also be stored in a storage module of the data verification apparatus, for example, the storage module 136 shown in FIG. 2 .
步骤290,结束。Step 290, end.
上述步骤290表示结束数据一致性校验。The above step 290 indicates ending the data consistency check.
应理解,上述图8仅为示意并不对本申请提供的数据校验的过程进行任何限定。例如,在一些实现方式中,在确定数据表#1和数据表#2不一致之后还可以不执行步骤282和步骤283。It should be understood that the above-mentioned FIG. 8 is only for illustration and does not impose any limitation on the data verification process provided by the present application. For example, in some implementations, steps 282 and 283 may not be performed after it is determined that data table #1 and data table #2 are inconsistent.
图9是根据本申请提供的方法确定的默克尔树的示意图。FIG. 9 is a schematic diagram of a Merkle tree determined according to the method provided in the present application.
如图9所示,包括4个默克尔树,分别是默克尔树#3(即,上述第一默克尔树的一例)、默克尔树#4(即,上述第一默克尔树的另一例)、默克尔树#5(即,上述第二默克尔树的一例)、默克尔树#6(即,上述第二默克尔树的另一例)。其中,默克尔树#3和默克尔树#5树的高度为3,默克尔树#4和默克尔树#6树的高度为2。关于有关默克尔树#3、默克尔树#4、默克尔树#5、默克尔树#6的具体描述可以参见上文图7描述的内容,此处不再详细赘述。As shown in FIG. 9 , four Merkle trees are included, namely Merkle tree #3 (that is, an example of the first Merkle tree above) and Merkle tree #4 (that is, the first Merkle tree described above). Another example of Merkle tree), Merkle tree #5 (that is, an example of the above-mentioned second Merkle tree), Merkle tree #6 (that is, another example of the above-mentioned second Merkle tree). Among them, Merkle #3 and Merkle #5 trees have a height of 3, and Merkle #4 and Merkle #6 trees have a height of 2. For specific descriptions about Merkle tree #3, Merkle tree #4, Merkle tree #5, and Merkle tree #6, please refer to the content described in FIG. 7 above, and will not be repeated here.
其中,默克尔树#3,可以理解为,在时刻#1根据数据表#1(即,上述第一数据表的一例)中的全量数据生成的默克尔树。默克尔树#4,可以理解为,在时刻#2根据数据表#1中的增量数据生成的默克尔树,时刻#2为时刻#1之后的一个时刻,在时刻#2数据表#1中的增量数据包括数据表#1中如下行标识对应的行数据:“2”、“6”、“7”和“8”。 默克尔树#5,可以理解为,在时刻#1根据数据表#2(即,上述第二数据表的一例)中的全量数据生成的默克尔树。默克尔树#6,可以理解为,在时刻#2根据数据表#2中的增量数据生成的默克尔树,在时刻#2数据表#1中的增量数据包括数据表#2中如下行标识对应的行数据:“2”、“6”、“7”和“8”。其中,数据表#2为对数据表#1进行迁移或同步后得到的数据表。Among them, Merkle tree #3 can be understood as a Merkle tree generated at time #1 based on the full amount of data in data table #1 (that is, an example of the first data table above). Merkle tree #4 can be understood as the Merkle tree generated according to the incremental data in data table #1 at time #2, time #2 is a time after time #1, and at time #2 data table The incremental data in #1 includes row data corresponding to the following row identifiers in data table #1: "2", "6", "7", and "8". Merkle tree #5 can be understood as a Merkle tree generated at time #1 from the full amount of data in data table #2 (that is, an example of the second data table above). Merkle tree #6 can be understood as a Merkle tree generated according to the incremental data in data table #2 at time #2, and the incremental data in data table #1 at time #2 includes data table #2 The corresponding row data are identified in the following rows: "2", "6", "7" and "8". The data table #2 is a data table obtained after migrating or synchronizing the data table #1.
生成上述默克尔树#3、默克尔树#4、默克尔树#5、默克尔树#6的方法可以参见方法100,此处不再详细赘述。For the method of generating the above Merkle tree #3, Merkle tree #4, Merkle tree #5, and Merkle tree #6, reference may be made to method 100, which will not be described in detail here.
当需要对数据表#1和数据表#2进行一致性校验时,包括:When it is necessary to check the consistency of data table #1 and data table #2, including:
可以通过比较默克尔树#3和默克尔树#5,确定数据表#1和数据表#2中的全量数据是否一致;By comparing Merkle tree #3 and Merkle tree #5, it can be determined whether the full amount of data in data table #1 and data table #2 is consistent;
可以通过比较默克尔树#4和默克尔树#6,确定数据表#1和数据表#2中的增量数据是否一致。It can be determined whether the incremental data in data table #1 and data table #2 are consistent by comparing Merkle tree #4 and Merkle tree #6.
可选的,在确定数据表#1和数据表#2中的全量数据不一致之后,还可以通过比较默克尔树#3的叶子节点和默克尔树#5的叶子节点,确定不一致的行标识,并根据不一致的行标识从数据表#1和数据表#2中确定该不一致的行标识对应的行数据。Optionally, after determining that the full amount of data in data table #1 and data table #2 is inconsistent, you can also determine inconsistent rows by comparing the leaf nodes of Merkle tree #3 and the leaf nodes of Merkle tree #5. and the row data corresponding to the inconsistent row ID is determined from the data table #1 and the data table #2 according to the inconsistent row ID.
可选的,在确定数据表#1和数据表#2中的增量数据不一致之后,还可以通过比较默克尔树#4的叶子节点和默克尔树#6的叶子节点,确定不一致的行标识,并根据不一致的行标识从数据表#1和数据表#2中确定该不一致的行标识对应的行数据。Optionally, after determining that the incremental data in data table #1 and data table #2 are inconsistent, you can also determine the inconsistent data by comparing the leaf nodes of Merkle tree #4 and the leaf nodes of Merkle tree #6. row identifiers, and row data corresponding to the inconsistent row identifiers is determined from data table #1 and data table #2 according to the inconsistent row identifiers.
本申请实施例中,对增量数据采用独立的默克尔小树进行数据校验,能够进一步节省计算开销、提高数据一致性校验的效率。In the embodiment of the present application, an independent Merkle treelet is used to perform data verification on incremental data, which can further save computational overhead and improve the efficiency of data consistency verification.
上文结合图1至图9,详细描述了本申请提供的数据校验方法以及适用于该方法的系统架构等。下面,结合图10至图12详细介绍本申请提供的数据校验装置、数据校验设备和数据校验系统。应理解,方法实施例的描述与装置实施例的描述相互对应,因此,未详细描述的部分可以参见前面方法实施例。The data verification method provided by the present application and the system architecture suitable for the method are described in detail above with reference to FIG. 1 to FIG. 9 . Below, the data verification device, data verification device and data verification system provided by the present application will be described in detail with reference to FIG. 10 to FIG. 12 . It should be understood that the descriptions of the method embodiments correspond to the descriptions of the apparatus embodiments. Therefore, for the parts not described in detail, reference may be made to the foregoing method embodiments.
在本申请实施例中,数据校验装置中应包括处理单元和确定单元。该数据校验装置可以为上文中的数据校验装置130。In this embodiment of the present application, the data verification device should include a processing unit and a determination unit. The data verification device may be the data verification device 130 above.
可选的,在一些实现方式中,该数据校验装置中还可以包括收发单元。Optionally, in some implementation manners, the data verification apparatus may further include a transceiver unit.
下面,结合图10,以数据校验装置中包括处理单元和确定单元为例进行介绍。In the following, with reference to FIG. 10 , the data verification device includes a processing unit and a determination unit as an example for introduction.
图10是本申请提供的一种数据校验装置1000的示意性结构图。FIG. 10 is a schematic structural diagram of a data verification apparatus 1000 provided by the present application.
如图10所示,该装置1000包括:处理单元1001和确定单元1002。As shown in FIG. 10 , the apparatus 1000 includes: a processing unit 1001 and a determination unit 1002 .
处理单元1001,用于对第一数据库中的第一数据表处理,生成第一默克尔树,该第一数据表的每行包括行标识和行数据,该第一默克尔树包括N个第一叶子节点,该N个第一叶子节点与N个第一哈希桶一一对应,每个该第一叶子节点的哈希值是根据对应的第一哈希桶确定的,该N个第一哈希桶是对该第一数据表按照行标识进行哈希分区得到的,任意两个第一哈希桶不相同,N为大于等于2的正整数;The processing unit 1001 is configured to process the first data table in the first database to generate a first Merkle tree, each row of the first data table includes a row identifier and row data, and the first Merkle tree includes N The N first leaf nodes are in one-to-one correspondence with the N first hash buckets, and the hash value of each first leaf node is determined according to the corresponding first hash bucket. The first hash bucket is obtained by hash partitioning the first data table according to the row identifier, any two first hash buckets are different, and N is a positive integer greater than or equal to 2;
该处理单元1001,还用于对第二数据库中的第二数据表处理,生成第二默克尔树,该第二数据表是将该第一数据表同步或迁移到该第二数据库得到的,该第二默克尔树包括N个第二叶子节点,该N个第二叶子节点与N个第二哈希桶一一对应,每个该第二叶子节点的哈希值是根据对应的第二哈希桶确定的,该N个第二哈希桶是对该第二数据表按照 行标识进行哈希分区得到的,对该第二数据表按照行标识进行哈希分区得到该N个第二哈希桶的哈希规则与对该第一数据表按照行标识进行哈希分区得到该N个第一哈希桶的哈希规则相同,任意两个第二哈希桶不相同;The processing unit 1001 is further configured to process a second data table in the second database to generate a second Merkle tree, where the second data table is obtained by synchronizing or migrating the first data table to the second database , the second Merkle tree includes N second leaf nodes, the N second leaf nodes are in one-to-one correspondence with the N second hash buckets, and the hash value of each second leaf node is based on the corresponding Determined by the second hash bucket, the N second hash buckets are obtained by hash partitioning the second data table according to the row ID, and the N second data table is obtained by hash partitioning the second data table according to the row ID The hash rule of the second hash bucket is the same as the hash rule of the N first hash buckets obtained by hash partitioning the first data table according to the row identifier, and any two second hash buckets are different;
确定单元1002,用于将该第一默克尔树与该第二默克尔树进行比较,确定该第一数据表与该第二数据表是否一致。The determining unit 1002 is configured to compare the first Merkle tree with the second Merkle tree to determine whether the first data table is consistent with the second data table.
可选的,该第一数据表包括M行,M为大于等于1的正整数,Optionally, the first data table includes M rows, where M is a positive integer greater than or equal to 1,
该处理单元1001还用于:The processing unit 1001 is also used for:
对该M行进行哈希处理,得到M个第一哈希组,该M个第一哈希组与该M行一一对应,每个该第一哈希组包括该M行中的一个行标识和与该一个行标识对应的行数据的哈希值,每个该第一哈希组包括的行标识不相同;Hash the M rows to obtain M first hash groups, the M first hash groups are in one-to-one correspondence with the M rows, and each of the first hash groups includes one row in the M rows The identifier and the hash value of the row data corresponding to the one row identifier, the row identifiers included in each of the first hash groups are different;
将该M个第一哈希组映射至该N个第一哈希桶;mapping the M first hash groups to the N first hash buckets;
该确定单元1002还用于:The determining unit 1002 is also used for:
根据该N个第一哈希桶确定该N个第一叶子节点的哈希值;Determine the hash values of the N first leaf nodes according to the N first hash buckets;
该处理单元1001还用于:The processing unit 1001 is also used for:
根据该N个第一叶子节点的哈希值,生成该第一默克尔树。The first Merkle tree is generated according to the hash values of the N first leaf nodes.
可选的,每个该第一叶子节点的哈希值是根据对应的第一哈希桶包括的第一哈希组包括的哈希值进行异或运算得到的哈希值。Optionally, the hash value of each first leaf node is a hash value obtained by performing an XOR operation on the hash values included in the first hash group included in the corresponding first hash bucket.
可选的,该确定单元1002还用于:Optionally, the determining unit 1002 is further configured to:
确定该第一默克尔树根节点的哈希值和该第二默克尔树根节点的哈希值;determining the hash value of the root node of the first Merkle tree and the hash value of the root node of the second Merkle tree;
如果该第一默克树根节点的哈希值与该第二默克树根节点的哈希值相同,确定该第一数据表与该第二数据表一致;If the hash value of the first Merkle root node is the same as the hash value of the second Merkk tree root node, determine that the first data table is consistent with the second data table;
如果该第一默克树根节点的哈希值与该第二默克树根节点的哈希值不相同,确定该第一数据表与该第二数据表不一致。If the hash value of the root node of the first Merkke tree is different from the hash value of the root node of the second Merkke tree, it is determined that the first data table is inconsistent with the second data table.
可选的,该确定单元1002还用于:Optionally, the determining unit 1002 is further configured to:
确定第i个第一叶子节点的哈希值与第i个第二叶子节点的哈希值不相同,该第i个第一叶子节点的哈希值是根据该第i个第一哈希桶确定的,该第i个第二叶子节点的哈希值是根据该第i个第二哈希桶确定的,i为正整数,且1≤i≤N;It is determined that the hash value of the ith first leaf node is different from the hash value of the ith second leaf node, and the hash value of the ith first leaf node is based on the ith first hash bucket Determined, the hash value of the ith second leaf node is determined according to the ith second hash bucket, i is a positive integer, and 1≤i≤N;
比较该第i个第一哈希桶与该第i个第二哈希桶,确定该第一数据表与该第二数据表不一致的行标识;Compare the i-th first hash bucket and the i-th second hash bucket, and determine the row identifiers that are inconsistent between the first data table and the second data table;
该处理单元1001还用于:The processing unit 1001 is also used for:
根据该不一致的行标识分别从该第一数据库和该第二数据库中查询该不一致的行标识对应的行数据。The row data corresponding to the inconsistent row IDs are queried from the first database and the second database respectively according to the inconsistent row IDs.
可选的,该第一数据表包括该第一数据库中的至少一个数据表中的全量数据。Optionally, the first data table includes the full amount of data in at least one data table in the first database.
可选的,该第一数据表包括该第一数据库中的至少一个数据表中的增量数据。Optionally, the first data table includes incremental data in at least one data table in the first database.
可选的,该第一默克尔树的高度与该第一数据表相关联。Optionally, the height of the first Merkle tree is associated with the first data table.
可选的,该第一数据库与该第二数据库是异构数据库或同构数据库。Optionally, the first database and the second database are heterogeneous databases or homogeneous databases.
可选的,该第一数据库为关系型数据库或非关系型数据库,该第二数据库为关系型数据库或非关系型数据库。Optionally, the first database is a relational database or a non-relational database, and the second database is a relational database or a non-relational database.
下面,结合图11,以数据校验设备中包括收发器、处理器和存储器为例进行介绍。In the following, with reference to FIG. 11 , the data verification device including a transceiver, a processor and a memory is used as an example for introduction.
图11是本申请提供的一种数据校验设备1000的示意性结构图。如图11所示,该设备1000包括:收发器1010、处理器1020和存储器1030。其中,收发器1010、处理器1020和存储器1030之间通过内部连接通路互相通信,传递控制和/或数据信号,该存储器1030用于存储计算机程序,该处理器1010用于从该存储器1030中调用并运行该计算机程序,以控制该收发器1020收发信号。FIG. 11 is a schematic structural diagram of a data verification device 1000 provided by the present application. As shown in FIG. 11 , the device 1000 includes: a transceiver 1010 , a processor 1020 and a memory 1030 . The transceiver 1010 , the processor 1020 and the memory 1030 communicate with each other through an internal connection path to transmit control and/or data signals. The memory 1030 is used to store computer programs, and the processor 1010 is used to call from the memory 1030 And run the computer program to control the transceiver 1020 to send and receive signals.
具体的,收发器1010可以用于获取上文中的第一数据表和第二数据表,此处不再赘述。Specifically, the transceiver 1010 can be used to obtain the above-mentioned first data table and second data table, which will not be repeated here.
具体的,处理器1020的功能与图10所示的处理单元1001和确定单元1002的具体功能相对应,此处不再赘述。Specifically, the functions of the processor 1020 correspond to the specific functions of the processing unit 1001 and the determination unit 1002 shown in FIG. 10 , and details are not repeated here.
在本申请实施例中,数据校验设备中应包括处理器。其中,该数据校验设备可以为上文中描述的终端设备中的任意一种设备。In this embodiment of the present application, the data verification device should include a processor. Wherein, the data verification device may be any one of the terminal devices described above.
可选的,在一些实现方式中,该数据校验设备中还可以包括收发器。Optionally, in some implementations, the data verification device may further include a transceiver.
可选的,在一些实现方式中,该数据校验设备中还可以包括存储器。Optionally, in some implementations, the data verification device may further include a memory.
图12是本申请提供的一种系统1200的结构示意图。如图12所示,该系统1200包括:上文中数据校验装置1000或数据校验设备1100。可选的,该系统1200还可以包括上文中的第一数据库和上文中的第二数据库。FIG. 12 is a schematic structural diagram of a system 1200 provided by the present application. As shown in FIG. 12 , the system 1200 includes: the data verification apparatus 1000 or the data verification device 1100 mentioned above. Optionally, the system 1200 may further include the above-mentioned first database and the above-mentioned second database.
本申请实施例提供了一种计算机程序产品,当该计算机程序产品在数据校验装置1310上运行时,使得数据校验装置1310可以执行上述方法实施例中的方法100和/或方法200。Embodiments of the present application provide a computer program product, which when the computer program product runs on the data verification apparatus 1310, enables the data verification apparatus 1310 to execute the method 100 and/or the method 200 in the above method embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例中描述的各方法步骤和单元,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各实施例的步骤及组成。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art can realize that, in combination with the method steps and units described in the embodiments disclosed herein, they can be implemented in electronic hardware, computer software, or a combination of the two. Interchangeability, the steps and components of the various embodiments have been generally described in terms of functions in the above description. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Persons of ordinary skill in the art may use different methods of implementing the described functionality for each particular application, but such implementations should not be considered beyond the scope of this application.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参见前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the above-described systems, devices and units, reference may be made to the corresponding processes in the foregoing method embodiments, which are not repeated here.
在本申请所提供的几个实施例中,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,该单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。In the several embodiments provided in this application, the disclosed systems, devices and methods may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
该作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。The unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present application.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
该集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例中方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application are essentially or part of contributions to the prior art, or all or part of the technical solutions can be embodied in the form of software products, and the computer software products are stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
以上描述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above descriptions are only specific implementations of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art can easily think of various equivalent modifications within the technical scope disclosed in the present application. or replacement, these modifications or replacements should be covered within the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机程序指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例中的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,该计算机程序指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质(例如软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,DVD)、或者半导体介质(例如固态硬盘)等。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer program instructions. When the computer program instructions are loaded and executed on a computer, the procedures or functions according to the embodiments of the present application are generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program instructions may be transmitted from a website site, computer, server or data center via Wired or wireless transmission to another website site, computer, server or data center. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes one or more available media integrated. The available media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, digital video discs (DVDs), or semiconductor media (eg, solid state drives), and the like.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps of implementing the above embodiments can be completed by hardware, or can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium. The storage medium can be read-only memory, magnetic disk or optical disk, etc.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.
另外,本申请中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系;本申请中术语“至少一个”,可以表示“一个”和“两个或两个以上”,例如,A、B和C中至少一个,可以表示:单独存在A,单独存在B,单独存在C、同时存在A和B,同时存在A和C,同时存在C和B,同时存在A和B和C,这七种情况。In addition, the term "and/or" in this application is only an association relationship to describe associated objects, which means that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, and A and B exist at the same time. , there are three cases of B alone. In addition, the character "/" in this document generally indicates that the contextual object is an "or" relationship; the term "at least one" in this application can mean "one" and "two or more", for example, A At least one of , B, and C can mean: A alone exists, B exists alone, C exists alone, A and B exist simultaneously, A and C exist simultaneously, C and B exist simultaneously, and A and B and C exist simultaneously. seven situations.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims (24)

  1. 一种数据校验方法,其特征在于,所述方法包括:A data verification method, characterized in that the method comprises:
    对第一数据库中的第一数据表处理,生成第一默克尔树,所述第一数据表的每行包括行标识和行数据,所述第一默克尔树包括N个第一叶子节点,所述N个第一叶子节点与N个第一哈希桶一一对应,每个所述第一叶子节点的哈希值是根据对应的第一哈希桶确定的,所述N个第一哈希桶是对所述第一数据表按照行标识进行哈希分区得到的,任意两个第一哈希桶不相同,N为大于等于2的正整数;The first data table in the first database is processed to generate a first Merkle tree, each row of the first data table includes a row identifier and row data, and the first Merkle tree includes N first leaves node, the N first leaf nodes are in one-to-one correspondence with the N first hash buckets, the hash value of each first leaf node is determined according to the corresponding first hash bucket, and the N first leaf nodes are in one-to-one correspondence. The first hash bucket is obtained by hash partitioning the first data table according to the row identifier, any two first hash buckets are different, and N is a positive integer greater than or equal to 2;
    对第二数据库中的第二数据表处理,生成第二默克尔树,所述第二数据表是将所述第一数据表同步或迁移到所述第二数据库得到的,所述第二默克尔树包括N个第二叶子节点,所述N个第二叶子节点与N个第二哈希桶一一对应,每个所述第二叶子节点的哈希值是根据对应的第二哈希桶确定的,所述N个第二哈希桶是对所述第二数据表按照行标识进行哈希分区得到的,对所述第二数据表按照行标识进行哈希分区得到所述N个第二哈希桶的哈希规则与对所述第一数据表按照行标识进行哈希分区得到所述N个第一哈希桶的哈希规则相同,任意两个第二哈希桶不相同;Process the second data table in the second database to generate a second Merkle tree, the second data table is obtained by synchronizing or migrating the first data table to the second database, and the second data table is obtained by synchronizing or migrating the first data table to the second database. The Merkle tree includes N second leaf nodes, the N second leaf nodes are in one-to-one correspondence with the N second hash buckets, and the hash value of each second leaf node is based on the corresponding second leaf node. Hash buckets are determined, the N second hash buckets are obtained by hash partitioning the second data table according to row identifiers, and the second data table is hash partitioned according to row identifiers to obtain the The hash rule of the N second hash buckets is the same as the hash rule of the N first hash buckets obtained by hash partitioning the first data table according to row identifiers. Any two second hash buckets have the same hash rule. Are not the same;
    将所述第一默克尔树与所述第二默克尔树进行比较,确定所述第一数据表与所述第二数据表是否一致。Comparing the first Merkle tree with the second Merkle tree, it is determined whether the first data table is consistent with the second data table.
  2. 根据权利要求1所述的方法,其特征在于,所述第一数据表包括M行,M为大于等于1的正整数,所述对第一数据库中的第一数据表处理,生成第一默克尔树,包括:The method according to claim 1, wherein the first data table includes M rows, where M is a positive integer greater than or equal to 1, and the first data table in the first database is processed to generate a first default data table. Kerr tree, including:
    对所述M行进行哈希处理,得到M个第一哈希组,所述M个第一哈希组与所述M行一一对应,每个所述第一哈希组包括所述M行中的一个行标识和与所述一个行标识对应的行数据的哈希值,每个所述第一哈希组包括的行标识不相同;将所述M个第一哈希组映射至所述N个第一哈希桶;Hash processing is performed on the M rows to obtain M first hash groups, the M first hash groups are in one-to-one correspondence with the M rows, and each of the first hash groups includes the M A row identifier in the row and the hash value of the row data corresponding to the one row identifier, the row identifiers included in each of the first hash groups are different; the M first hash groups are mapped to the N first hash buckets;
    根据所述N个第一哈希桶确定所述N个第一叶子节点的哈希值;Determine the hash values of the N first leaf nodes according to the N first hash buckets;
    根据所述N个第一叶子节点的哈希值,生成所述第一默克尔树。The first Merkle tree is generated according to the hash values of the N first leaf nodes.
  3. 根据权利要求2所述的方法,其特征在于,The method of claim 2, wherein:
    每个所述第一叶子节点的哈希值是根据对应的第一哈希桶包括的第一哈希组包括的哈希值进行异或运算得到的哈希值。The hash value of each of the first leaf nodes is a hash value obtained by performing an XOR operation on the hash values included in the first hash group included in the corresponding first hash bucket.
  4. 根据权利要1-3任一项所述的方法,其特征在于,所述将所述第一默克尔树与所述第二默克尔树进行比较,确定所述第一数据表与所述第二数据表是否一致,包括:The method according to any one of claims 1-3, wherein the comparing the first Merkle tree with the second Merkle tree determines that the first data table and the Whether the second data sheet is consistent, including:
    确定所述第一默克尔树根节点的哈希值和所述第二默克尔树根节点的哈希值;determining the hash value of the first Merkle tree root node and the hash value of the second Merkle tree root node;
    如果所述第一默克树根节点的哈希值与所述第二默克树根节点的哈希值相同,确定所述第一数据表与所述第二数据表一致;If the hash value of the first Merkk tree root node is the same as the hash value of the second Merkk tree root node, it is determined that the first data table is consistent with the second data table;
    如果所述第一默克树根节点的哈希值与所述第二默克树根节点的哈希值不相同,确定所述第一数据表与所述第二数据表不一致。If the hash value of the first Merkke tree root node is different from the hash value of the second Merkk tree root node, it is determined that the first data table is inconsistent with the second data table.
  5. 根据权利要求4所述的方法,其特征在于,在确定所述第一数据表与所述第二数据表不一致之后,所述方法还包括:The method according to claim 4, wherein after determining that the first data table is inconsistent with the second data table, the method further comprises:
    确定第i个第一叶子节点的哈希值与第i个第二叶子节点的哈希值不相同,所述第i 个第一叶子节点的哈希值是根据所述第i个第一哈希桶确定的,所述第i个第二叶子节点的哈希值是根据所述第i个第二哈希桶确定的,i为正整数,且1≤i≤N;It is determined that the hash value of the ith first leaf node is different from the hash value of the ith second leaf node, and the hash value of the ith first leaf node is based on the hash value of the ith first leaf node. Determined by the bucket, the hash value of the i-th second leaf node is determined according to the i-th second hash bucket, i is a positive integer, and 1≤i≤N;
    比较所述第i个第一哈希桶与所述第i个第二哈希桶,确定所述第一数据表与所述第二数据表不一致的行标识;Compare the i-th first hash bucket and the i-th second hash bucket, and determine the row identifiers that are inconsistent between the first data table and the second data table;
    根据所述不一致的行标识分别从所述第一数据库和所述第二数据库中查询所述不一致的行标识对应的行数据。The row data corresponding to the inconsistent row IDs are queried from the first database and the second database respectively according to the inconsistent row IDs.
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述第一数据表包括所述第一数据库中的至少一个数据表中的全量数据。The method according to any one of claims 1-5, wherein the first data table includes the full amount of data in at least one data table in the first database.
  7. 根据权利要求1-5任一项所述的方法,其特征在于,所述第一数据表包括所述第一数据库中的至少一个数据表中的增量数据。The method according to any one of claims 1-5, wherein the first data table includes incremental data in at least one data table in the first database.
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述第一默克尔树的高度与所述第一数据表相关联。The method according to any one of claims 1-7, wherein the height of the first Merkle tree is associated with the first data table.
  9. 根据权利要求1-8任一项所述的方法,其特征在于,所述第一数据库与所述第二数据库是异构数据库或同构数据库。The method according to any one of claims 1-8, wherein the first database and the second database are heterogeneous databases or homogeneous databases.
  10. 根据权利要求1-9任一项所述的方法,其特征在于,所述第一数据库为关系型数据库或非关系型数据库,所述第二数据库为关系型数据库或非关系型数据库。The method according to any one of claims 1-9, wherein the first database is a relational database or a non-relational database, and the second database is a relational database or a non-relational database.
  11. 一种数据校验装置,其特征在于,所述装置包括:A data verification device, characterized in that the device comprises:
    处理单元,用于对第一数据库中的第一数据表处理,生成第一默克尔树,所述第一数据表的每行包括行标识和行数据,所述第一默克尔树包括N个第一叶子节点,所述N个第一叶子节点与N个第一哈希桶一一对应,每个所述第一叶子节点的哈希值是根据对应的第一哈希桶确定的,所述N个第一哈希桶是对所述第一数据表按照行标识进行哈希分区得到的,任意两个第一哈希桶不相同,N为大于等于2的正整数;a processing unit, configured to process the first data table in the first database to generate a first Merkle tree, each row of the first data table includes a row identifier and row data, and the first Merkle tree includes N first leaf nodes, the N first leaf nodes are in one-to-one correspondence with N first hash buckets, and the hash value of each first leaf node is determined according to the corresponding first hash bucket , the N first hash buckets are obtained by hash partitioning the first data table according to row identifiers, any two first hash buckets are different, and N is a positive integer greater than or equal to 2;
    所述处理单元,还用于对第二数据库中的第二数据表处理,生成第二默克尔树,所述第二数据表是将所述第一数据表同步或迁移到所述第二数据库得到的,所述第二默克尔树包括N个第二叶子节点,所述N个第二叶子节点与N个第二哈希桶一一对应,每个所述第二叶子节点的哈希值是根据对应的第二哈希桶确定的,所述N个第二哈希桶是对所述第二数据表按照行标识进行哈希分区得到的,对所述第二数据表按照行标识进行哈希分区得到所述N个第二哈希桶的哈希规则与对所述第一数据表按照行标识进行哈希分区得到所述N个第一哈希桶的哈希规则相同,任意两个第二哈希桶不相同;The processing unit is further configured to process the second data table in the second database to generate a second Merkle tree, and the second data table is to synchronize or migrate the first data table to the second data table. Obtained from the database, the second Merkle tree includes N second leaf nodes, the N second leaf nodes are in one-to-one correspondence with the N second hash buckets, and the hash value of each second leaf node is The value is determined according to the corresponding second hash bucket, and the N second hash buckets are obtained by hash partitioning the second data table according to the row identifier, and the second data table is obtained according to the row identifier. The hash rule for obtaining the N second hash buckets by performing hash partitioning on the identifier is the same as the hash rule for obtaining the N first hash buckets by performing hash partitioning on the first data table according to the row identifier, Any two second hash buckets are not the same;
    确定单元,用于将所述第一默克尔树与所述第二默克尔树进行比较,确定所述第一数据表与所述第二数据表是否一致。A determining unit, configured to compare the first Merkle tree with the second Merkle tree to determine whether the first data table is consistent with the second data table.
  12. 根据权利要求11所述的装置,其特征在于,所述第一数据表包括M行,M为大于等于1的正整数,The device according to claim 11, wherein the first data table comprises M rows, where M is a positive integer greater than or equal to 1,
    所述处理单元还用于:The processing unit is also used to:
    对所述M行进行哈希处理,得到M个第一哈希组,所述M个第一哈希组与所述M行一一对应,每个所述第一哈希组包括所述M行中的一个行标识和与所述一个行标识对应的行数据的哈希值,每个所述第一哈希组包括的行标识不相同;Hash processing is performed on the M rows to obtain M first hash groups, the M first hash groups are in one-to-one correspondence with the M rows, and each of the first hash groups includes the M A row identifier in the row and the hash value of the row data corresponding to the one row identifier, the row identifiers included in each of the first hash groups are different;
    将所述M个第一哈希组映射至所述N个第一哈希桶;mapping the M first hash groups to the N first hash buckets;
    所述确定单元还用于:The determining unit is also used for:
    根据所述N个第一哈希桶确定所述N个第一叶子节点的哈希值;Determine the hash values of the N first leaf nodes according to the N first hash buckets;
    所述处理单元还用于:The processing unit is also used to:
    根据所述N个第一叶子节点的哈希值,生成所述第一默克尔树。The first Merkle tree is generated according to the hash values of the N first leaf nodes.
  13. 根据权利要求12所述的装置,其特征在于,The apparatus of claim 12, wherein:
    每个所述第一叶子节点的哈希值是根据对应的第一哈希桶包括的第一哈希组包括的哈希值进行异或运算得到的哈希值。The hash value of each of the first leaf nodes is a hash value obtained by performing an XOR operation on the hash values included in the first hash group included in the corresponding first hash bucket.
  14. 根据权利要11-13任一项所述的装置,其特征在于,所述确定单元还用于:The device according to any one of claims 11-13, wherein the determining unit is further configured to:
    确定所述第一默克尔树根节点的哈希值和所述第二默克尔树根节点的哈希值;determining the hash value of the first Merkle root node and the hash value of the second Merkle root node;
    如果所述第一默克树根节点的哈希值与所述第二默克树根节点的哈希值相同,确定所述第一数据表与所述第二数据表一致;If the hash value of the first Merkk tree root node is the same as the hash value of the second Merkk tree root node, it is determined that the first data table is consistent with the second data table;
    如果所述第一默克树根节点的哈希值与所述第二默克树根节点的哈希值不相同,确定所述第一数据表与所述第二数据表不一致。If the hash value of the first Merkk tree root node is different from the hash value of the second Merkk tree root node, it is determined that the first data table is inconsistent with the second data table.
  15. 根据权利要14所述的装置,其特征在于,The device of claim 14, wherein:
    所述确定单元还用于:The determining unit is also used for:
    确定第i个第一叶子节点的哈希值与第i个第二叶子节点的哈希值不相同,所述第i个第一叶子节点的哈希值是根据所述第i个第一哈希桶确定的,所述第i个第二叶子节点的哈希值是根据所述第i个第二哈希桶确定的,i为正整数,且1≤i≤N;It is determined that the hash value of the ith first leaf node is different from the hash value of the ith second leaf node, and the hash value of the ith first leaf node is based on the hash value of the ith first leaf node. Determined by the bucket, the hash value of the i-th second leaf node is determined according to the i-th second hash bucket, i is a positive integer, and 1≤i≤N;
    比较所述第i个第一哈希桶与所述第i个第二哈希桶,确定所述第一数据表与所述第二数据表不一致的行标识;Compare the i-th first hash bucket and the i-th second hash bucket, and determine the row identifiers that are inconsistent between the first data table and the second data table;
    所述处理单元还用于:The processing unit is also used to:
    根据所述不一致的行标识分别从所述第一数据库和所述第二数据库中查询所述不一致的行标识对应的行数据。The row data corresponding to the inconsistent row IDs are queried from the first database and the second database respectively according to the inconsistent row IDs.
  16. 根据权利要11-15任一项所述的装置,其特征在于,所述第一数据表包括所述第一数据库中的至少一个数据表中的全量数据。The apparatus according to any one of claims 11 to 15, wherein the first data table includes the full amount of data in at least one data table in the first database.
  17. 根据权利要求11-16任一项所述的装置,其特征在于,所述第一数据表包括所述第一数据库中的至少一个数据表中的增量数据。The apparatus according to any one of claims 11-16, wherein the first data table includes incremental data in at least one data table in the first database.
  18. 根据权利要求11-17任一项所述的装置,其特征在于,所述第一默克尔树的高度与所述第一数据表相关联。The apparatus according to any one of claims 11-17, wherein the height of the first Merkle tree is associated with the first data table.
  19. 根据权利要求11-18任一项所述的装置,其特征在于,所述第一数据库与所述第二数据库是异构数据库或同构数据库。The apparatus according to any one of claims 11-18, wherein the first database and the second database are heterogeneous databases or homogeneous databases.
  20. 根据权利要求11-19任一项所述的装置,其特征在于,所述第一数据库为关系型数据库或非关系型数据库,所述第二数据库为关系型数据库或非关系型数据库。The apparatus according to any one of claims 11-19, wherein the first database is a relational database or a non-relational database, and the second database is a relational database or a non-relational database.
  21. 一种数据校验装置,其特征在于,包括至少一个处理器和通信接口,所述至少一个处理器,用于执行计算机程序或指令,以使得所述数据校验的执行如权利要求1至10中任一项所述的方法。A data verification device, characterized in that it comprises at least one processor and a communication interface, wherein the at least one processor is used to execute a computer program or instruction, so that the data verification is performed as claimed in claims 1 to 10 The method of any of the above.
  22. 根据权利要求21所述的数据校验装置,其特征在于,所述装置还包括至少一个存储器,所述至少一个存储器与所述至少一个处理器耦合,所述计算机程序或指令存储在所述至少一个存储器中。The data verification apparatus of claim 21, wherein the apparatus further comprises at least one memory coupled to the at least one processor, wherein the computer program or instructions are stored in the at least one memory in a memory.
  23. 一种计算机可读存储介质,其特征在于,用于存储计算机指令,当所述计算机指 令被执行时,如权利要求1至10中任一项所述的方法被实现。A computer-readable storage medium for storing computer instructions, when the computer instructions are executed, the method according to any one of claims 1 to 10 is implemented.
  24. 一种系统,其特征在于,包括如权利要求21或22所述的数据校验装置。A system, characterized by comprising the data verification device as claimed in claim 21 or 22.
PCT/CN2021/120282 2020-09-28 2021-09-24 Data verification method, apparatus, and system WO2022063223A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011040390.7 2020-09-28
CN202011040390.7A CN114281793A (en) 2020-09-28 2020-09-28 Data verification method, device and system

Publications (1)

Publication Number Publication Date
WO2022063223A1 true WO2022063223A1 (en) 2022-03-31

Family

ID=80846243

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120282 WO2022063223A1 (en) 2020-09-28 2021-09-24 Data verification method, apparatus, and system

Country Status (2)

Country Link
CN (1) CN114281793A (en)
WO (1) WO2022063223A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114912150A (en) * 2022-05-13 2022-08-16 北京航星永志科技有限公司 Data processing and acquiring method and device and electronic equipment
CN116860825A (en) * 2023-06-14 2023-10-10 北京科技大学 Verifiable retrieval method and system based on blockchain
CN117194390A (en) * 2023-11-08 2023-12-08 建信金融科技有限责任公司 Database migration method and device
CN117251460A (en) * 2023-08-10 2023-12-19 上海栈略数据技术有限公司 Data consistency check system for graph database and relational database

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115840753A (en) * 2022-09-23 2023-03-24 超聚变数字技术有限公司 Data verification method and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894079A (en) * 2010-07-15 2010-11-24 哈尔滨工程大学 Hash tree memory integrity protection method of variable length storage block
EP3313020A1 (en) * 2016-10-24 2018-04-25 Aliasnet S.R.L. Method of digital identity generation and authentication
CN108427601A (en) * 2017-02-13 2018-08-21 北京航空航天大学 A kind of cluster transaction processing method of privately owned chain node
AU2019204764A1 (en) * 2018-07-03 2020-01-23 Servicenow, Inc. Multi-instance architecture supporting trusted blockchain-based network
CN110958109A (en) * 2019-10-12 2020-04-03 上海电力大学 Light dynamic data integrity auditing method based on hierarchical Mercker Hash tree
CN110989994A (en) * 2019-11-18 2020-04-10 腾讯科技(深圳)有限公司 Block chain-based code version management method and device, terminal and storage medium
CN111625258A (en) * 2020-05-22 2020-09-04 深圳前海微众银行股份有限公司 Mercker tree updating method, device, equipment and readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894079A (en) * 2010-07-15 2010-11-24 哈尔滨工程大学 Hash tree memory integrity protection method of variable length storage block
EP3313020A1 (en) * 2016-10-24 2018-04-25 Aliasnet S.R.L. Method of digital identity generation and authentication
CN108427601A (en) * 2017-02-13 2018-08-21 北京航空航天大学 A kind of cluster transaction processing method of privately owned chain node
AU2019204764A1 (en) * 2018-07-03 2020-01-23 Servicenow, Inc. Multi-instance architecture supporting trusted blockchain-based network
CN110958109A (en) * 2019-10-12 2020-04-03 上海电力大学 Light dynamic data integrity auditing method based on hierarchical Mercker Hash tree
CN110989994A (en) * 2019-11-18 2020-04-10 腾讯科技(深圳)有限公司 Block chain-based code version management method and device, terminal and storage medium
CN111625258A (en) * 2020-05-22 2020-09-04 深圳前海微众银行股份有限公司 Mercker tree updating method, device, equipment and readable storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114912150A (en) * 2022-05-13 2022-08-16 北京航星永志科技有限公司 Data processing and acquiring method and device and electronic equipment
CN116860825A (en) * 2023-06-14 2023-10-10 北京科技大学 Verifiable retrieval method and system based on blockchain
CN116860825B (en) * 2023-06-14 2024-01-26 北京科技大学 Verifiable retrieval method and system based on blockchain
CN117251460A (en) * 2023-08-10 2023-12-19 上海栈略数据技术有限公司 Data consistency check system for graph database and relational database
CN117251460B (en) * 2023-08-10 2024-04-05 上海栈略数据技术有限公司 Data consistency check system for graph database and relational database
CN117194390A (en) * 2023-11-08 2023-12-08 建信金融科技有限责任公司 Database migration method and device
CN117194390B (en) * 2023-11-08 2024-02-09 建信金融科技有限责任公司 Database migration method and device

Also Published As

Publication number Publication date
CN114281793A (en) 2022-04-05

Similar Documents

Publication Publication Date Title
WO2022063223A1 (en) Data verification method, apparatus, and system
US10628449B2 (en) Method and apparatus for processing database data in distributed database system
US11693877B2 (en) Cross-ontology multi-master replication
CN106570086B (en) Data migration system and data migration method
US9658911B2 (en) Selecting a directory of a dispersed storage network
US9600513B2 (en) Database table comparison
WO2020233146A1 (en) Data operation record storage method, system and apparatus, and device
US10963481B2 (en) Custom object-in-memory format in data grid network appliance
CN107085570B (en) Data processing method, application server and router
US11176110B2 (en) Data updating method and device for a distributed database system
CN103176988A (en) Data migration system based on software-as-a-service (SaaS)
US11573961B2 (en) Delta graph traversing system
EP3499379B1 (en) Computer implemented and computer controlled method, computer program product and platform for manipulating data arranged for processing and storage at a data storage engine
US8407255B1 (en) Method and apparatus for exploiting master-detail data relationships to enhance searching operations
CN112912870A (en) Tenant identifier conversion
CN111290714B (en) Data reading method and device
US11620311B1 (en) Transformation of directed graph into relational data
CN111835871A (en) Method and device for transmitting data file and method and device for receiving data file
US20110191549A1 (en) Data Array Manipulation
CN116303789A (en) Parallel synchronization method and device for multi-fragment multi-copy database and readable medium
Kantabutra et al. Intentionally-Linked Entities: A Better Database System for Representing Dynamic Social Networks, Narrative Geographic Information Sytem and General Abstractions of Reality
WO2020087955A1 (en) Method, apparatus and system for processing hard disk identifier duplication
US10771095B2 (en) Data processing device, data processing method, and computer readable medium
US11971891B1 (en) Accessing siloed data across disparate locations via a unified metadata graph systems and methods
CN110807119B (en) Face duplicate checking method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21871592

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21871592

Country of ref document: EP

Kind code of ref document: A1