CN114281793A - Data verification method, device and system - Google Patents

Data verification method, device and system Download PDF

Info

Publication number
CN114281793A
CN114281793A CN202011040390.7A CN202011040390A CN114281793A CN 114281793 A CN114281793 A CN 114281793A CN 202011040390 A CN202011040390 A CN 202011040390A CN 114281793 A CN114281793 A CN 114281793A
Authority
CN
China
Prior art keywords
hash
data
data table
database
row
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011040390.7A
Other languages
Chinese (zh)
Inventor
黄凯耀
郑云洲
孟小珍
李龙
赵俊
李志学
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202011040390.7A priority Critical patent/CN114281793A/en
Priority to PCT/CN2021/120282 priority patent/WO2022063223A1/en
Publication of CN114281793A publication Critical patent/CN114281793A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data verification method, a device and a system. The method comprises the following steps: and generating a first Merckel tree according to a first data table in a first database, and generating a second Merckel tree according to a second data table in a second database, wherein the first Merckel tree and the second Merckel tree have the same structure and generation method, so that whether the first data table and the second data table are consistent can be determined by directly comparing the hash values of the root nodes of the two Merckel trees. Therefore, the data verification method provided by the application can meet the requirement of accurately and quickly verifying mass data.

Description

Data verification method, device and system
Technical Field
The present application relates to the field of storage, and more particularly, to a data verification method, apparatus and system.
Background
During data synchronization or data migration of a database (e.g., a heterogeneous database or a homogeneous database) system, it is necessary to provide a check on data consistency of synchronization tables of a source-end database and a target-end database to verify correctness of data synchronization or data migration. In practical application, there is usually a problem that data in the source-end database is inconsistent with data in the target-end database. On one hand, in the data transmission and data storage process, data loss and data errors caused by hardware faults, software defects, human errors, environmental interference and other factors exist, so that data in the source end database is inconsistent with data in the target end database. On the other hand, due to the performance problem of the database system, there may be a delay of a certain time when the change of the source end data table is synchronized to the target end database, which causes the data in the source end database and the data table in the target end database to be checked to be inconsistent at a certain time.
The traditional data verification method cannot meet the requirement of accurately and quickly verifying mass data.
Disclosure of Invention
The application provides a data verification method, a device and a system, which can be applied to a data synchronization or data migration scene, and the method can better meet the requirement of accurate and rapid verification on mass data.
In a first aspect, a data verification method is provided, where the method includes:
processing a first data table in a first database to generate a first Merck tree, wherein each row of the first data table comprises a row identifier and row data, the first Merck tree comprises N first leaf nodes, the N first leaf nodes correspond to N first hash buckets one by one, the hash value of each first leaf node is determined according to the corresponding first hash bucket, the N first hash buckets are obtained by carrying out hash partitioning on the first data table according to the row identifier, any two first hash buckets are different, and N is a positive integer greater than or equal to 2;
processing a second data table in a second database to generate a second merkel tree, wherein the second data table is obtained by synchronizing or transferring the first data table to the second database, the second merkel tree comprises N second leaf nodes, the N second leaf nodes correspond to N second hash buckets one by one, the hash value of each second leaf node is determined according to the corresponding second hash bucket, the N second hash buckets are obtained by performing hash partitioning on the second data table according to row identification, the hash rule of the N second hash buckets obtained by performing hash partitioning on the second data table according to row identification is the same as the hash rule of the N first hash buckets obtained by performing hash partitioning on the first data table according to row identification, and any two second hash buckets are different;
comparing the first and second merkel trees to determine whether the first and second data tables are consistent.
In the above technical solution, the hash value of each first leaf node of the first merkel tree corresponding to the first data table is determined according to the corresponding first hash bucket, and the hash value of each second leaf node of the second merkel tree corresponding to the second data table is determined according to the corresponding second hash bucket. Since the hash rule for performing hash partitioning on the second data table according to the row identifier to obtain the N second hash buckets is the same as the hash rule for performing hash partitioning on the first data table according to the row identifier to obtain the N first hash buckets, it can be ensured that data corresponding to the same row identifier in the second data table and the first data table are respectively mapped to the second hash bucket and the first hash bucket having the same sequence number, so that the generated first merkel tree and the second merkel tree have the same structure. Therefore, whether the first data table and the second data table are consistent or not can be determined directly by comparing the hash values of the two root nodes of the Merckel tree. Therefore, the data verification method provided by the application can meet the requirement of accurately and quickly verifying mass data.
With reference to the first aspect, in certain implementations of the first aspect, the first data table includes M rows, where M is a positive integer greater than or equal to 1, and the processing the first data table in the first database to generate the first mercker tree includes:
performing hash processing on the first data table to obtain M first hash groups, wherein the M first hash groups correspond to the M rows one by one, each first hash group comprises a row identifier in the first data table and a hash value of row data corresponding to the row identifier, and the row identifiers of each first hash group are different;
mapping the M first hash groups to the N first hash buckets;
determining hash values of the N first leaf nodes according to the N first hash buckets;
and generating the first Mercker tree according to the Hash values of the N first leaf nodes.
It should be understood that the method of generating the second merkel tree is the same as described above for the second data table processing in the second database. Specifically, when the second data table includes K rows, K is a positive integer less than or equal to N, and the processing on the second data table in the second database to generate the second merckel tree includes:
performing hash processing on the second data table to obtain K second hash groups, wherein the K second hash groups correspond to the K rows one by one, each second hash group comprises a row identifier in the second data table and a hash value of row data corresponding to the row identifier, and the row identifiers of each second hash group are different;
mapping the K second hash groups to the N second hash buckets;
determining hash values of the N second leaf nodes according to the N second hash buckets;
and generating the second Mercker tree according to the Hash values of the N second leaf nodes.
When K is equal to M, it is understood that the second data table includes the same number of rows as the first data table, that is, if there is no data missing during the process of synchronizing or migrating the first data table to the second database. When K is less than M, it is understood that the second data table includes a same number of rows as the first data table, that is, if there is a data loss during the synchronization or migration of the first data table to the second database.
It is also understood that the mapping rules that map the K second hash groups to the N second hash buckets are the same as the mapping rules that map the M first hash groups to the N first hash buckets.
In the above technical solution, a data partitioning algorithm is adopted to map the row data in the data table to hash buckets, the hash buckets correspond to leaf nodes of the mercker tree one to one, and since the mapping rules for mapping the K second hash groups to the N second hash buckets are the same as the mapping rules for mapping the M first hash groups to the N first hash buckets, the generated first mercker tree and the second mercker tree have the same structure.
With reference to the first aspect, in certain implementation manners of the first aspect, the hash value of each first leaf node is a hash value obtained by performing an exclusive or operation on hash values included in a first hash group included in the corresponding first hash bucket.
It should be understood that the hash value of each second leaf node is a hash value obtained by performing an exclusive or operation on the hash values included in the second hash group included in the corresponding second hash bucket.
In the technical scheme, the hash value of the leaf node is obtained by performing exclusive-or operation on the data in the hash bucket, so that when consistency verification is performed on the first data table and the second data table, sequencing processing on the data can be avoided, and the efficiency of data verification can be further improved.
With reference to the first aspect, in certain implementations of the first aspect, the comparing the first merkel tree to the second merkel tree to determine whether the first data table and the second data table are consistent includes:
determining the hash value of the first Merck tree root node and the hash value of the second Merck tree root node;
if the hash value of the first Merck tree root node is the same as the hash value of the second Merck tree root node, determining that the first data table is consistent with the second data table;
and if the hash value of the first Merck tree root node is not the same as the hash value of the second Merck tree root node, determining that the first data table is inconsistent with the second data table.
In the above technical solution, since the structures (for example, the row identifiers in the data tables corresponding to the tree height and the leaf node) of the first merck tree and the second merck tree are completely the same, when the consistency check is performed, it can be determined accurately and quickly whether the first data table and the second data table are consistent by determining whether the hash value of the first merck tree root node is the same as the hash value of the second merck tree root node. Specifically, when the hash value of the first merkel tree root node is the same as the hash value of the second merkel tree root node, it may be determined that the first data table is consistent with the second data table. When the hash value of the first merkel root node is not the same as the hash value of the second merkel root node, it may be determined that the first data table is inconsistent with the second data table.
With reference to the first aspect, in certain implementations of the first aspect, after determining that the first data table is inconsistent with the second data table, the method further includes:
determining that the hash value of an ith first leaf node is different from the hash value of an ith second leaf node, wherein the hash value of the ith first leaf node is determined according to the ith first hash bucket, the hash value of the ith second leaf node is determined according to the ith second hash bucket, i is a positive integer, and i is more than or equal to 1 and less than or equal to N;
comparing the ith first hash bucket with the ith second hash bucket, and determining the inconsistent row identifier of the first data table and the second data table;
and respectively inquiring the row data corresponding to the inconsistent row identification from the first database and the second database according to the inconsistent row identification.
In the above technical solution, after it is determined that the first data table and the second data table are inconsistent, the inconsistent row identifier in the first database and the inconsistent row identifier in the second database may be specifically determined according to the hash value of the leaf node and the hash bucket corresponding to the leaf node. In the step of performing, the row data corresponding to the inconsistent row identifier may be queried from the first database and the second database according to the determined inconsistent row identifier. Since the data set size of the leaf node is controllable, the time required to determine the inconsistent row identity is also controllable.
With reference to the first aspect, in certain implementations of the first aspect, the first data table includes a full amount of data in at least one data table in the first database.
With reference to the first aspect, in certain implementations of the first aspect, the first data table includes incremental data in at least one data table in the first database.
In the technical scheme, the full data and the incremental data can be separated and respectively checked as two stages, so that the calculation cost can be saved. Specifically, because the full amount of data includes a large amount of data, a higher-level merkel tree can be constructed when the full amount of data is verified. Because the incremental data comprises a small amount of data, the Merck tree with a low layer number can be constructed when the incremental data is checked.
With reference to the first aspect, in certain implementations of the first aspect, a height of the first merkel tree is associated with the first data table.
In the above technical solution, the height of the first merkel tree can be adaptively adjusted according to the size of the first data table to be checked.
With reference to the first aspect, in certain implementations of the first aspect, the first database and the second database are heterogeneous databases or homogeneous databases.
In the technical scheme, the data verification method provided by the application can be suitable for data consistency verification of homogeneous databases and data consistency verification of heterogeneous databases.
With reference to the first aspect, in certain implementations of the first aspect, the first database is a relational database or a non-relational database, and the second database is a relational database or a non-relational database.
In a second aspect, a data checking apparatus is provided, which performs the method in the first aspect and any possible implementation manner of the first aspect.
It should be understood that the data verification device provided by the present application is independent and decoupled from the database system, so that the data verification device does not have invasive effects on the database system. For example, the function and performance of the database system are affected or the resources of the database system are occupied.
In a third aspect, a data verification device is provided, where the device includes a memory and a processor, where the memory is configured to store instructions, and the processor is configured to read the instructions stored in the memory, so that the data verification device executes the method in the first aspect and any possible implementation manner of the first aspect.
In a fourth aspect, a processor is provided, comprising: input circuit, output circuit and processing circuit. The processing circuitry is configured to receive signals via the input circuitry and to transmit signals via the output circuitry, such that any of the first aspects and the method of any possible implementation of the first aspects are implemented.
In a specific implementation process, the processor may be a chip, the input circuit may be an input pin, the output circuit may be an output pin, and the processing circuit may be a transistor, a gate circuit, a flip-flop, various logic circuits, and the like. The input signal received by the input circuit may be received and input by, for example and without limitation, a receiver, the signal output by the output circuit may be output to and transmitted by a transmitter, for example and without limitation, and the input circuit and the output circuit may be the same circuit that functions as the input circuit and the output circuit, respectively, at different times. The embodiment of the present application does not limit the specific implementation manner of the processor and various circuits.
In a fifth aspect, a processing apparatus is provided that includes a processor and a memory. The processor is configured to read instructions stored in the memory, and may receive a signal via the receiver and transmit a signal via the transmitter to perform the method of the first aspect and any possible implementation manner of the first aspect.
Optionally, the number of the processors is one or more, and the number of the memories is one or more.
Alternatively, the memory may be integral to the processor or provided separately from the processor.
In a specific implementation process, the memory may be a non-transient memory, such as a Read Only Memory (ROM), which may be integrated on the same chip as the processor, or may be separately disposed on different chips.
It will be appreciated that the associated data interaction process, for example, sending the indication information, may be a process of outputting the indication information from the processor, and receiving the capability information may be a process of receiving the input capability information from the processor. In particular, the data output by the processor may be output to a transmitter and the input data received by the processor may be from a receiver. The transmitter and receiver may be collectively referred to as a transceiver, among others.
A sixth aspect provides a computer-readable storage medium for storing a computer program comprising instructions for performing the method of the first aspect above and any possible implementation manner of the first aspect above.
In a seventh aspect, a computer program product is provided that comprises instructions, which when run on a computer, cause the computer to perform the method of the first aspect and any possible implementation manner of the first aspect.
In an eighth aspect, a system is provided, which comprises the data verification device of the second aspect.
In a ninth aspect, a chip is provided, comprising at least one processor and an interface; the at least one processor is configured to invoke and run a computer program, so that the chip executes the method in the first aspect and any possible implementation manner of the first aspect.
Drawings
FIG. 1 is a schematic diagram of a system 100 suitable for use with the data verification methods provided herein.
Fig. 2 is a schematic diagram of the data verification apparatus 130 provided in the present application.
Fig. 3 is a schematic flow chart diagram of a data verification method 100 provided herein.
FIG. 4 is a schematic illustration of a Mercker tree determined according to the method provided herein.
FIG. 5 is a schematic diagram of data extraction from a first data table provided in the present application
Fig. 6 is a schematic diagram of hash partitioning of data extracted from a first data table according to the present application.
FIG. 7 is a schematic illustration of a Mercker tree determined according to the method provided herein.
Fig. 8 is a schematic flow chart diagram of a data verification method 200 provided herein.
FIG. 9 is a schematic illustration of a Mercker tree determined according to the method provided herein.
Fig. 10 is a schematic structural diagram of a data verification apparatus 1000 according to the present application.
Fig. 11 is a schematic structural diagram of a data verification apparatus 1000 provided in the present application.
Fig. 12 is a schematic structural diagram of a system 1200 provided in the present application.
Detailed Description
The technical solution in the present application will be described below with reference to the accompanying drawings.
The terminology used in the description of the embodiments section of the present application is for the purpose of describing particular embodiments of the present application only and is not intended to be limiting of the present application.
The terms "first," "second," "third," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it is to be understood that "first," "second," and "third" do not have any logical or temporal dependency or limitation on the number or order of execution.
This application is intended to present various aspects, embodiments or features around a system that may include a number of devices, components, modules, and the like. It is to be understood and appreciated that the various systems may include additional devices, components, modules, etc. and/or may not include all of the devices, components, modules etc. discussed in connection with the figures. Furthermore, a combination of these schemes may also be used.
In addition, in the embodiments of the present application, words such as "exemplary", "for example", etc. are used to mean serving as examples, illustrations or explanations. Any embodiment or design described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the term using examples is intended to present concepts in a concrete fashion.
In the embodiments of the present application, "corresponding" and "corresponding" may be sometimes used in a mixed manner, and it should be noted that the intended meaning is consistent when the difference is not emphasized.
In the examples of the present application, the subscripts are sometimes as W1It may be mistaken for a non-subscripted form such as W1, whose intended meaning is consistent when the distinction is de-emphasized.
The network architecture and the service scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not form a limitation on the technical solution provided in the embodiment of the present application, and as a person of ordinary skill in the art knows that along with the evolution of the network architecture and the appearance of a new service scenario, the technical solution provided in the embodiment of the present application is also applicable to similar technical problems.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
In the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.
Next, the related art of the present application is described:
for ease of understanding, prior to describing the data verification method provided herein, a brief introduction will first be made to the pertinent terms referred to in this application.
1. Data check (data verify)
Data verification is a verification operation performed to ensure the integrity of data. Usually, a check value is calculated for the original data by a specified algorithm, and the receiver calculates a check value by the same algorithm, and if the check values obtained by the two calculations are the same, the data are consistent.
2. Data replication (data replication)
Data replication, a technique for replicating data from one location to another, involves sharing information to ensure consistency between redundant resources (e.g., software or hardware components) to improve reliability, fault tolerance, or accessibility.
3. Mercker tree (merkel tree)
The merkel tree may also be referred to as a hash tree. The Merck tree is a binary tree consisting of a root node, a set of intermediate nodes, and a set of leaf nodes. The lowest leaf node contains the stored data or its hash value, each intermediate node is the hash value of its contents of its two child nodes, and the root node is also composed of the hash values of its contents of its two child nodes. When the data extraction module generates new data, the respective Merck trees are dynamically updated.
4. Hash (hash)
Hash is a function that maps data of arbitrary length to fixed length data. Slight changes in the input data cause the hash result to be of all sorts, and it is generally considered impossible to reverse the characteristics of the original input data based on the hash value.
5. Heterogeneous Database (HDB)
The heterogeneous database is a collection of related database systems, and can realize sharing and transparent access of data, each database system already exists before being added into the heterogeneous database system, and has a database management system (DBMS). Each component of the heterogeneous database has autonomy, and each database system still has application characteristics, integrity control and security control of the database while realizing data sharing.
6. Isomorphic database
The homogeneous database means that all sites use common DBMS software, and all sites know each other and cooperate with each other to process the requirements of users.
7. Relational Database (RD)
RD, refers to a database that uses a relational model to organize data, and stores data in rows and columns for easy understanding by users, wherein a series of rows and columns of a relational database are called tables, and a group of tables constitutes the database. A user retrieves data in a database by a query, which is an executable code that defines certain areas in the database. The relational model can be simply understood as a two-dimensional table model, and a relational database is a data organization composed of two-dimensional tables and relations between them.
8. Non-relational database (not only sql, NoSQL)
NoSQL refers to a database that employs a non-relational model to organize data. Non-relational databases may include several types: key-value store databases (e.g., Oracle BDB), column store databases (e.g., HBase), document-type databases (e.g., CouchDB or mongoddb), and graph databases, among others.
Generally, in the process of synchronizing data between a source database and a target database, data to be synchronized needs to be transmitted by relying on a physical medium or a network, so that the source database and the target database are delayed to a certain extent. In addition, there are many uncertain factors affecting the data to be synchronized during transmission, such as hardware failure, software defect, human error, environmental interference, etc., which may affect the reliability of the data to be synchronized (e.g., data loss or data error, etc.). The above reasons may cause the problem that the source database and the target database are inconsistent. Therefore, after the data synchronization is completed, the consistency check needs to be performed on the synchronized data stored in the target end database, so that the reliability of the data stored in the target end database is ensured. Currently, an offline data checking method is generally adopted, data generated by application software running is acquired from a production database (i.e., a source database) of each independent data source to obtain a uniform offline database (i.e., a target database), and thus, whether the data of each independent data source is consistent is checked by using the offline data. However, since the data acquisition from the production database needs to occupy the database resources, if the influence on the database performance is to be reduced, the frequency of acquiring the data needs to be reduced, and the data acquisition can be performed only in a time with a small traffic volume, which will affect the timeliness of the offline data verification. In addition, in the data verification process, data is generally required to be sorted, and the data sorting process usually requires a large amount of system resources. Therefore, by using the offline data verification method, the service requirements cannot be met when the massive data (for example, TB-level data) is verified.
The application provides a data verification method, device and system, which can better meet the requirements of accurate and rapid verification of mass data.
For ease of understanding, first, with reference to fig. 1 and fig. 2, a system and a data verification apparatus suitable for the data verification method provided in the present application will be described in detail.
FIG. 1 is a schematic diagram of a system 100 suitable for use with the data verification methods provided herein.
As shown in fig. 1, the system 100 can be used in, but is not limited to, the following scenarios: a data migration scenario or a data synchronization scenario for the database. The system 100 may include at least one source-side database 110, at least one target-side database 120, and at least one data checking device 130. The data checking device 130 is a system on a third-party hardware device independent of the source-side database 110 and the target-side database 120, where the source-side database 110 is a database before data migration or replication, and the target-side database 120 is a database after data migration or replication.
In this application, the type of the source database 110 and the type of the target database 120 are not particularly limited.
In one example, the source database 110 or the target database 120 may be a relational database. For example, the source database 110 or the target database 120 may be any one of the following relational databases: oracle, DB2, Microsoft SQL Server, Microsoft Access, MySQL. It should be understood that the type of relational database is illustrative only and not limiting to the system 100. For example, a relational database may also be another type of relational database than the list above.
In another example, the source database 110 or the target database 120 may be a non-relational database. For example, the source database 110 or the target database 120 may be any one of the following non-relational databases: NoSQL, Cloudant, and montodb. It should be understood that the type of non-relational database is illustrative only and not limiting to the system 100. For example, the non-relational database may also be another type of non-relational database than the list above.
In yet another example, the source database 110 may be a non-relational database, and the target database 120 may be a relational database. For example, the source database 110 may be a NoSQL database, and the target database 120 may be an Oracle database.
In the present application, the source database 110 and the target database 120 may be homogeneous databases. The source database 110 and the target database 120 may also be heterogeneous databases, which is not limited in this respect.
In this application, the deployment of the source database 110, the target database 120, and the data checking device 130 in a device is not specifically limited, but it is required to ensure that the data checking device 130 is a system independent from the source database 110 and the target database 120.
In an example, the source database 110 may be a physical module or a virtual module deployed on the physical device #1, the target database 120 may be a physical module or a virtual module deployed on the physical device #2, the data checking device 130 may be a physical module or a virtual module deployed on the physical device #3, and the physical device #1, the physical device #2, and the physical device #3 are different devices.
In another example, the source database 110 and the target database 120 may be different physical modules or virtual modules deployed on the physical device #1, the data checking device 130 may be a physical module or virtual module deployed on the physical device #2, and the physical device #1 and the physical device #3 are different devices.
Referring to fig. 1, the source database 110 and the target database 120 may interact with each other (e.g., data migration or data synchronization, etc.), and the source database 110 and the target database 120 may also interact with the data checking apparatus 130, respectively. After the data in the source database 110 is synchronized or migrated to the target database 120, the data verification device 130 may extract the data to be verified from the source database 110 and the data to be verified from the target database 120, and perform consistency verification on the two extracted data portions, so as to verify whether the data after data migration or data synchronization has consistency between the source database 110 and the target database 120. When the data verification device 130 determines that the data to be verified extracted from the source database does not have consistency with the data to be verified extracted from the target database, it may further determine which data is inconsistent. When the data verification apparatus 130 has a storage function, the result of the consistency verification may also be stored in the data verification apparatus 130.
It should be understood that fig. 1 is merely illustrative and not intended to limit the applicable systems of the present application in any way. For example, a greater number of source databases 110 and/or target databases 120 and/or data validators 130 may be included in the system 100. For example, the data verification device 130 may further include other modules, such as a verification execution module, a source to-be-verified data management module, a target to-be-verified data management module, and the like.
Next, referring to fig. 2, a schematic structural diagram of the data verification apparatus 130 in fig. 1 provided in the present application is described.
Fig. 2 is a schematic diagram of the data verification apparatus 130 provided in the present application.
As shown in fig. 2, the apparatus 130 may include: a source data extraction module 131, a source processing module 132, a target data processing module 133, a target data extraction module 134, a comparison module 135, and a storage module 136. Wherein the modules may be connected by internal connection paths. For example, the source processing module 132 may interact with the comparison module 135, the source data extraction module 131, and the target processing module 133.
The source data extraction module 131 is configured to obtain data from a source database (e.g., the source database 110). For example, the source data extraction module 131 may obtain data from the source database 110 in fig. 1.
The source processing module 131 is configured to obtain data from the source data extraction module 131, and perform hash processing and data partition processing on the obtained data.
And the target processing module 133 is configured to obtain data from the target data extraction module 134, and perform hash processing, data partitioning processing, and the like on the obtained data.
And a target-side data extraction module 134, configured to obtain data from a target-side database (e.g., the target-side database 120). For example, the target-side data extraction module 134 may obtain data from the target-side database 120 in fig. 1.
A comparing module 135, configured to obtain the mercker trees corresponding to the data from the source processing module 132 and the target processing module 133, and perform data consistency verification based on the obtained mercker trees. The comparison module 135 is a core module of the data verification device 130. Specifically, the comparison module 135 may further include a data comparison sub-module and a data back-check sub-module. The data comparison submodule can quickly compare and search the inconsistent row data identification data set through the Mercker tree, and the data reverse-search submodule can reversely search detailed data from the database according to the inconsistent row data identification, and finally finds a data value corresponding to the inconsistent row identification.
A storage module 136 for storing data and instructions.
It should be understood that fig. 2 is only an illustration and does not constitute any limitation to the data verification apparatus 130 provided in the present application. For example, the source processing module 132 and the target processing module 133 in the data verification apparatus 130 may also be included in the same processing module. For example, the source data extraction module 131 and the target data extraction module 134 in the data verification apparatus 130 may also be included in the same processing module. For example, when the comparison module 135 in the data verification apparatus 130 has the function of the storage module 136, the data verification apparatus 130 may not include the storage module 136.
The data verification method provided by the present application is described in detail below with reference to fig. 3 to 8.
Fig. 3 is a schematic flow chart diagram of a data verification method 100 provided herein.
As shown in fig. 3, method 100 may include steps 110 through 130. The details of steps 110 to 130 are described below. The main body of the steps 110 to 130 may be the data verification apparatus 130 shown in fig. 2.
Step 110, processing the first data table in the first database to generate a first Mercker tree.
The first database may understand a production database, i.e. a source database. For example, in one example, the first database may be the source database 110 shown in FIG. 1.
In the present application, the data source and the data size included in the first data table are not limited.
In one example, the first data table may include a full amount of data in at least one data table in the first database.
Optionally, the first data table may further include two or even more data tables in the first database. In this case, the first data table may be understood as a data set composed of two or more data tables in the first database.
For example, when the data table #1, the data table #2, and the data table #3 are included in the first database, the first data table may include all data in the data table # 1. The first data table may also include all the data in data table #1 and data table # 3. The first data table may further include all data in data table #1, data table #2, and data table # 3.
In another example, the first data table may include incremental data in at least one data table in the first database. That is to say, the data verification method provided by the present application may also perform consistency verification only on data that changes in the data table.
For example, at time #1, data table #1 and data table #2 have consistency, where data table #2 is obtained by copying data table # 1. After time #1, part of the data in the data table #1 changes (e.g., data update, data increase or data decrease, etc.). In this case, the data in which the data table #1 is changed may be considered to be the data included in the first data table.
In this application, each row of the first data table may include a row identifier and row data, the first tacher tree includes N first leaf nodes, the N first leaf nodes correspond to the N first hash buckets one to one, the hash value of each first leaf node is determined according to the corresponding first hash bucket, the N first hash buckets are obtained by performing hash partitioning on the first data table according to the row identifier, any two first hash buckets are different, and N is a positive integer greater than or equal to 2.
In the present application, the type of the row id included in the first data table is not particularly limited.
In one example, the row identification may be a numeric row identification. For example, the numeric row identification may be "5".
In another example, the line identifier may be a string-type line identifier. For example, the string-type line identifier may be "zhang san" or "lie si", or the like.
When the line identifier is a string-type line identifier, before constructing the mercker tree, the string-type line identifier needs to be processed to obtain a hash value corresponding to the string-type line identifier.
In this application, N first hash buckets are obtained by performing hash partitioning on the first data table according to the row identifier, and it can be understood that, no matter whether the row identifier of the first data table is a numerical row identifier or a string row identifier, when mapping data in the first data table to the hash buckets, the row identifier included in the first data table may be subjected to hash operation to obtain hash values corresponding to the row identifier, and then the hash values corresponding to the row identifier are subjected to modulo operation, and then the hash bucket corresponding to the row where the row identifier is located is determined according to the result of the modulo operation.
The N first leaf nodes correspond to the N first hash buckets one to one, and it can be understood that an ith first leaf node (i.e., a first leaf node with sequence number i) in the N first leaf nodes corresponds to an ith first hash bucket (i.e., a first hash bucket with sequence number i) in the N first hash buckets. That is, the first leaf node with sequence number i corresponds to the first hash bucket with sequence number i. The sequence number of each first leaf node in the N first leaf nodes is different, the sequence number of each first hash bucket in the N first hash buckets is different, and i is a positive integer greater than or equal to 1 and less than or equal to N.
The N first hash buckets are obtained by performing hash partitioning on the first data table according to the row identifier, and the correspondence between the first hash bucket and the first data table is not specifically limited in the present application.
In one example, each first hash bucket is determined from a row in the first data table. At this time, each first hash bucket corresponds to one row of the first data table. In this case, each first hash bucket includes the same number of row identifications in the first data table.
In another example, at least one of the N first hash buckets is determined from two or more rows in the first data table. At this time, at least one first hash bucket corresponds to two or more lines of the first data table. In this case, the number of row identifications in the first data table included in each first hash bucket may not be the same.
Optionally, at least one of the N first hash buckets may be empty. That is, the N-1 first hash buckets are determined based on all of the rows of data included in the first data table, and the remaining one of the first hash buckets does not include any data in the first data table.
The two first hash buckets are different from each other, and it can be understood that the sequence numbers corresponding to the two first hash buckets are different from each other, and the row identifiers in the first data table included in the two first hash buckets are different from each other.
For example, hash bucket #1 has a sequence number of 1, and hash bucket #1 includes 2 row identifiers, "5" and "6", respectively, in the first data table, hash bucket #2 has a sequence number of 2, and hash bucket #2 includes 1 row identifier, "1", in the first data table. In this case, hash bucket #1 may be considered to be different from hash bucket # 2.
In one example, the first data table may include M rows, M being a positive integer greater than or equal to N. In this case, the above processing on the first data table in the first database to generate the first merkel tree may include the following steps:
performing hash processing on the M rows to obtain M first hash groups, wherein the M first hash groups correspond to the M rows one by one, each first hash group comprises a row identifier in the M rows and a hash value of row data corresponding to the row identifier, and the row identifiers of each first hash group are different;
mapping the M first hash groups to N first hash buckets;
determining hash values of N first leaf nodes according to N first hash buckets;
and generating a first Merck tree according to the Hash values of the N first leaf nodes.
In the present application, the number of first hash groups included in each of the N first hash buckets is not particularly limited.
For example, the number of first hash groups included in each of the N first hash buckets may be the same. For example, the number of the first hash groups included in each of the N first hash buckets may be different. For example, a part of the N first hash buckets includes the same number of first hash groups, and the rest of the N first hash buckets includes different numbers of first hash groups.
In this application, the hash value of each first leaf node may be a hash value obtained by performing an exclusive or operation on hash values included in the first hash group included in the corresponding first hash bucket. It should be understood that when there is a first hash bucket corresponding to a first leaf node that does not include any of the M first hash groups, the hash value of that first leaf node may be null.
The determining the hash values of the N first leaf nodes according to the N first hash buckets may include:
when the first hash bucket corresponding to at least one first leaf node in the N first leaf nodes includes at least one first hash group in the M first hash groups, the hash value of at least one first leaf node is obtained by performing an exclusive or operation according to the hash value included in at least one first hash group in the M first hash groups included in the corresponding first hash bucket.
Optionally, the first hash bucket corresponding to the at least one first leaf node may further include two or more first hash groups in the M first hash groups.
When the first hash bucket corresponding to at least one first leaf node in the N first leaf nodes does not include one first hash group in the M first hash groups, the hash value of the at least one first leaf node is equal to zero.
The height of the first merkel tree is associated with a first data table. Before the first merkel tree is established, relevant parameters of the first merkel tree, such as the number of leaf nodes included in the first merkel tree, the number of high trees and the like, need to be determined according to the size of the first data table. Wherein the tree height of the first merkel tree is adaptively varied according to the size of the amount of data included in the first data table. The larger the amount of data included in the first data table, the higher the tree height of the first merkel number. In other words, the height of the corresponding first merck tree when the first data table includes a larger amount of data (e.g., 1GB) is higher than the height of the corresponding first merck tree when the first data table includes a smaller amount of data (e.g., 100 MB).
Next, the above method for generating the first merkel tree according to the hash value of the first leaf node is described by taking the merkel tree shown in fig. 4 as an example.
As shown in fig. 4, the merkel tree (i.e., an example of the merkel tree described above) has a tree height of 3, a number of intermediate nodes of 2, and a number of leaf nodes of 4. The top-most layer is the root node, the next top layer is the middle node, the next layer is the leaf node, and the bottom-most layer is the hash bucket described above (i.e., an example of the first hash bucket described above).
For convenience of description, the 4 leaf nodes of the merkel tree (i.e., an example of the first leaf node mentioned above) from left to right may be labeled as: leaf node 1, leaf node 2, leaf node 3, and leaf node 4. From left to right, the 4 hash buckets of the merkel tree can be labeled: hash bucket 1, hash bucket 2, hash bucket 3, and hash bucket 4. Wherein, 4 leaf nodes of the Mercker tree correspond to 4 hash buckets one by one. Specifically, leaf node 1 corresponds to hash bucket 1, leaf node 2 corresponds to hash bucket 2, leaf node 3 corresponds to hash bucket 3, and leaf node 4 corresponds to hash bucket 4.
The hash value of each leaf node of the merkel tree is obtained by performing exclusive or operation on the hash value included in the hash bucket corresponding to each leaf node. If the hash value of the leaf node 1 is obtained by performing an exclusive-or operation on the hash value included in the hash bucket 1, that is, the hash value of the leaf node 1 may be represented as N0 ═ XOR (1,5), and XOR (1,5) represents a result of performing an exclusive-or operation on the hash value with the row identifier 1 (i.e., 0xeffe898) and the hash value with the row identifier 5 (i.e., 0xb8b8dd) included in the hash bucket 1.
The hash value of each intermediate node of the merkel tree is obtained by carrying out hash operation on the hash values of two child nodes of the merkel tree. For example, N4 ═ H (N0, N1) represents the hash value of one intermediate node of the mercker tree, and H (N0, N1) represents the result of the hash operation performed on the hash values of two leaf nodes of the one intermediate node (i.e., N0 and N1).
The hash value of the root node of the Mercker tree is obtained by carrying out hash operation on the hash values of the two child nodes. For example, H (N4, N5) represents the hash value of the root node of the mercker tree.
The ith leaf node #1 may be a leaf node #1 having a sequence number i, and the ith hash bucket #1 may be a hash bucket #1 having a sequence number i, where i is 1,2,3, or 4.
It should be understood that fig. 4 is illustrative only and does not constitute any limitation to the present application. For example, in some implementations, the merkel tree shown in fig. 4 may also include a greater number of leaf nodes. For example, the Merck tree shown in FIG. 4 may also have a higher tree height in some implementations.
Before step 110, the following operations may also be included: a first data table is obtained from a first database.
For convenience of description, the method for obtaining the first data table from the first database will be described in detail below with reference to fig. 5 and 6. It should be understood that fig. 5 and 6 are only schematic and do not limit the method for acquiring the first data table in the present application.
Fig. 5 is a schematic diagram of data extraction from a first data table provided in the present application. It should be understood that fig. 5 is an example only. For example, a greater number (e.g., 100 rows) or a lesser number (e.g., 4 rows) of row data may also be included in data table # 1. For example, the data extraction module may also include a greater number of threads.
As shown in fig. 5, the execution subject of extracting data from the first data table may be a data extraction module. Specifically, the data extraction module may be the source data extraction module 131 and the target data extraction module 134 shown in fig. 2. That is, the source data extraction module 131 and the target data extraction module 134 in fig. 2 have the data extraction functions described below.
In one example, extracting data from the first data table may include, but is not limited to, the steps of:
dividing a first data table into S batches of data according to rows, wherein S is a positive integer greater than or equal to 1;
s processing threads are used for processing the S batches of data, and the S processing threads correspond to the S batches of data one to one;
and enqueuing the processed S batches of data to the S queues, wherein the S batches of data correspond to the S queues one by one.
The number of the processing threads may be set according to the size of the first data table. For example, when the first data table is large, a larger number of processing threads may be set. For example, when the first data table is small, a smaller number of processing threads may be set.
In the above technical solution, the line data may be extracted from the first data table in batches, each batch of data may be processed by a separate thread, the threads may be executed in parallel, and the extracted data is placed in a corresponding data queue.
Alternatively, the same processing thread may be used to process the data in the data table.
In one example, the data extraction module can be the source-end data extraction module 131 in fig. 2, and the data extraction module can be the target-end data extraction module 134 in fig. 2. That is, the source data extraction module 131 and the target data extraction module 134 in fig. 2 have the functions of the data extraction modules.
Referring to fig. 5, there are 8 pieces of data in the data table #1 (i.e., an example of the first data table), and the 1 st to 4 th pieces of data may be used as the first batch of data, and the 5 th to 8 th pieces of data may be used as the second batch of data. The data extraction module may include two threads (threads) for extracting data, i.e., thread #1 and thread #2, wherein thread #1 may be responsible for extracting a first batch of data, and thread #2 may be responsible for extracting a second batch of data. Thread #1 puts the extracted data into queue #1, and thread #2 puts the extracted data into queue # 2. The data extraction threads, i.e., the thread #1 and the thread #2, may be executed in parallel to improve extraction efficiency.
Fig. 6 is a schematic diagram of hash partitioning of data extracted from a first data table according to row identifiers according to the present application.
As shown in fig. 6, the execution subject of hash partitioning of data extracted from the data table #1 (i.e., an example of the first data table) by row identification may be a data processing module. Specifically, the data processing module may be the source end processing module 132 and the target end data processing module 133 shown in fig. 2. That is, the source end processing module 132 and the target end data processing module 133 in fig. 2 have the hash partitioning function described below. Referring to fig. 6, 8 pieces of data are included in the data table #1, and the corresponding rows of the 8 pieces of data are respectively identified as 1,2,3, … …, and 8.
In this embodiment of the present application, the process of performing hash partitioning on the data table #1 according to the row identifier and mapping the result after hash partitioning to N (N ═ 4) hash buckets #1 may be as follows:
first, hash processing is performed on line data of each line of data included in the data table #1 to obtain hash values corresponding to the line data, and the obtained hash values corresponding to the line data and corresponding line identifiers are enqueued to the hash data queue #1 in sequence. For convenience of description, each row in the hash data queue #1 may be denoted as 1 hash group (i.e., an example of the first hash group described above). For example, the row of the 1 st hash group in hash data queue #1 is identified as 1, and the stored hash value is 0xffe 898. The row of the 5 th hash group in hash data queue #1 is identified as 1, and the stored hash value is 0xb8 bdd. Specifically, referring to fig. 6, there is no further example.
Optionally, in some implementation manners, hash operation may be performed on the row identifier and the row data of each row of data to obtain a hash value corresponding to the row identifier and a hash value corresponding to the row data.
Then, modulo operation is performed on the row identifier of the hash data queue #1, and the hash bucket corresponding to the row where the row identifier is located is determined according to the result of the modulo operation on the row identifier. Specifically, the hash data queue #1 includes 8 hash groups, and the result obtained by modulo 4 with respect to the row identifier 1 of the 1 st hash group is 1, and the result obtained by modulo 4 with respect to the row identifier 5 of the 5 th hash group is 1, so that the 1 st hash group and the 5 th hash group can be transferred to the 1 st hash bucket #1 (i.e., the hash bucket with sequence number 1). That is, the row data having row identification 1 included in the data table #1 is mapped into the hash bucket #1 having sequence number 1. Similarly, the above process may be performed on other hash groups in the hash data queue #1, and it may be obtained that the 2 nd hash group and the 6 th hash group are mapped to the 2 nd hash bucket #1, the 3 rd hash group and the 7 th hash group are mapped to the 3 rd hash bucket #1, and the 4 th hash group and the 8 th hash group are mapped to the 4 th hash bucket # 1.
It should be understood that fig. 6 is an example only. For example, a greater number (e.g., 8) or a lesser number (e.g., 2) of hash buckets #1 may also be included. For example, a greater number (e.g., 100 rows) or a lesser number (e.g., 4 rows) of row data may also be included in data table # 1.
And step 120, processing a second data table in a second database to generate a second Mercury tree.
The second database can be understood as a target-side database. For example, in one example, the second database may be the target-side database 120 shown in FIG. 1.
In this application, the second data table is obtained by synchronizing or migrating the first data table to the second database, the second tacher tree includes N second leaf nodes, the N second leaf nodes correspond to N second hash buckets one to one, the hash value of each second leaf node is determined according to the corresponding second hash bucket, the N second hash buckets are obtained by performing hash partitioning on the second data table according to the row identifier, the hash rule of performing hash partitioning on the second data table according to the row identifier to obtain the N second hash buckets is the same as the hash rule of performing hash partitioning on the first data table according to the row identifier to obtain the N first hash buckets, and any two second hash buckets are different.
In the present application, the second merkel tree comprises N first leaf nodes, and the first merkel tree also comprises N first leaf nodes. Since the two merkel trees comprise the same number of leaf nodes, the first merkel tree can be considered to have the same tree height as the second merkel tree. That is, the first and second merkel trees provided herein have the same tree height.
The N second leaf nodes correspond to the N second hash buckets one to one, and it can be understood that an ith second leaf node (i.e., the second leaf node with sequence number i) in the N second leaf nodes corresponds to an ith second hash bucket (i.e., the second hash bucket with sequence number i) in the N second hash buckets. That is, the second leaf node with sequence number i corresponds to the second hash bucket with sequence number i. The serial number of each second leaf node in the N second leaf nodes is different, the serial number of each second hash bucket in the N second hash buckets is different, and i is a positive integer greater than or equal to 1 and less than or equal to N.
The hash value of each second leaf node is determined according to the corresponding second hash bucket, and a specific determination method may be a method for determining the first leaf node according to the corresponding first hash bucket in step 110.
The N second hash buckets are obtained by performing hash partitioning on the second data table according to the row identifier, and the correspondence between the second hash buckets and the second data table is not specifically limited in the present application.
In one example, each second hash bucket is determined from a row in the second data table.
In another example, at least one of the N second hash buckets is determined from two or more rows in the second data table.
In yet another example, at least one of the N second hash buckets may also be empty.
The N second leaf nodes correspond to the N second hash buckets one to one, and it can be understood that an ith second leaf node in the N second leaf nodes corresponds to an ith hash bucket in the N second hash buckets, and i is a positive integer greater than or equal to 1 and less than or equal to N. It should also be understood that the sequence number of each of the N second leaf nodes is not the same.
The second data table is obtained by synchronizing or migrating the first data table to the second database, and may include the following cases:
if no data is missing during the process of synchronizing or migrating the first data table to the second database, the second data table comprises the same number of rows as the first data table.
For example, the first data table includes 10 rows, each row includes a row identifier and a row data, and if there is no data missing during the process of synchronizing or migrating a data table to the second data table, the second data table may also include 10 rows after synchronization or migration.
And if data is missing in the process of synchronizing or migrating the first data table to the second database, the number of rows included in the second data table is less than the number of rows included in the first data table and is the same.
For example, the first data table includes 10 rows, each row includes a row identifier and a row data, and if 1 row data is missing during the process of synchronizing or migrating a data table to the second data table, the second data table includes 9 rows after synchronization or migration.
That is, the second data table in the present application may include the same number of rows as the first data table, or the second data table may include the same number of rows as the first data table.
The hash rule for performing hash partitioning on the second data table according to the row identifier to obtain N second hash buckets is the same as the hash rule for performing hash partitioning on the first data table according to the row identifier to obtain N first hash buckets.
Any two second hash buckets are different, and it can be understood that the corresponding sequence numbers of any two second hash buckets are different, and the row identifiers included by any two non-empty second hash buckets are different.
In this application, when the second data table includes K rows and the first data table includes M rows, where K is a positive integer less than or equal to M, processing the second data table in the second database to generate the second merckel tree may include:
performing hash processing on the K rows to obtain K second hash groups, wherein the K second hash groups correspond to the K rows one by one, each second hash group comprises a row identifier in the K rows and a hash value of row data corresponding to the row identifier, and the row identifiers of each second hash group are different;
mapping the K second hash groups to N second hash buckets;
determining hash values of the N second leaf nodes according to the N second hash buckets;
and generating a second Merck tree according to the Hash values of the N second leaf nodes.
When K is equal to M, it is understood that the second data table includes the same number of rows as the first data table, that is, if there is no data missing during the process of synchronizing or migrating the first data table to the second database. When K is less than M, it is understood that the second data table includes a same number of rows as the first data table, that is, if there is a data loss during the synchronization or migration of the first data table to the second database.
It should be understood that the mapping rule for mapping the K second hash groups to the N second hash buckets is the same as the mapping rule for mapping the M first hash groups to the N first hash buckets, that is, the hash rule for hash partitioning the second data table according to the row identifier to obtain the N second hash buckets is the same as the hash rule for hash partitioning the first data table according to the row identifier to obtain the N first hash buckets.
For example, the first hash group with sequence number 1 includes row id 1 and the hash value of the corresponding row data, and the first hash group with sequence number 1 is mapped to the first hash bucket with sequence number 5. When the second data table includes a row with a row identifier of 1, the row with the row identifier of 1 corresponds to the second hash group with a sequence number of 1, and the second hash group with the sequence number of 1 is mapped to the second hash bucket with the sequence number of 5.
It should also be understood that when a second hash bucket corresponding to the hash value of a second leaf node is empty, the hash value of that second leaf node is also empty.
Before step 120, a second data table may be obtained from a second database.
The manner of obtaining the second data table is not particularly limited in the present application.
For example, in the process of copying the second database from the first database, the data with the check mark is taken as the data in the second database when the check mark is detected in the second database.
The content that is not described in detail in step 120 is the same as that described in step 110, and refer to step 110 for details, which are not described herein again.
Step 130, comparing the first merkel tree with the second merkel tree, and determining whether the first data table is consistent with the second data table.
Whether the first data table is consistent with the second data table or not can be understood as that the number of the row identifiers stored in the first data table is the same as that stored in the second data table, and the content of the row data corresponding to the same row identifier is also the same.
In this application, comparing the first merkel tree with the second merkel tree to determine whether the first data table is consistent with the second data table may include:
determining a hash value of a first Mercker tree root node and a hash value of a second Mercker tree root node;
if the hash value of the first Merck tree root node is the same as the hash value of the second Merck tree root node, determining that the first data table is consistent with the second data table;
and if the hash value of the first Merck tree root node is not the same as the hash value of the second Merck tree root node, determining that the first data table is inconsistent with the second data table.
In the above technical solution, since the hash value of each node of the merkel tree is obtained by performing hash operation according to the child node of each node, for example, the hash value of the root node is determined according to two intermediate nodes corresponding to the root node, and the hash value of the leaf node is determined according to data in the hash bucket corresponding to the leaf node. Therefore, when the hash value of the first merkel tree root node is the same as the hash value of the second merkel tree root node, the first data table and the second data table can be considered to have consistency. When the hash value of the first merkel root node is the same as the hash value of the second merkel root node, the first data table and the second data table may be considered to be different, i.e., inconsistent.
Optionally, after determining that the first data table is inconsistent with the second data table, the following operations may be further included:
determining that the hash value of the ith first leaf node is different from the hash value of the ith second leaf node, wherein the hash value of the ith first leaf node is determined according to the ith first hash bucket, the hash value of the ith second leaf node is determined according to the ith second hash bucket, i is a positive integer, and i is more than or equal to 1 and less than or equal to N;
comparing the ith first hash bucket with the ith second hash bucket, and determining the inconsistent row identifier of the first data table and the second data table;
and respectively inquiring the row data corresponding to the inconsistent row identifiers from the first database and the second database according to the inconsistent row identifiers.
For example, when the hash value of the first leaf node with sequence number 1 and the hash value of the second leaf node with sequence number 1 are compared to determine that the hash values are the same, the hash values of the leaf nodes with sequence number 2 are continuously compared, and so on until the hash value of the first leaf node with sequence number i and the hash value of the second leaf node with sequence number i are determined to be different. The first leaf node with the sequence number i may be understood as an ith first leaf node, and the second leaf node with the sequence number i may be understood as an ith second leaf node. For example, when the hash value of the first leaf node with the sequence number N is first compared with the hash value of the second leaf node with the sequence number N to determine that the hash values are the same, the hash values of the leaf nodes with the sequence number N-1 are continuously compared, and so on until the hash value of the first leaf node with the sequence number i is determined to be different from the hash value of the second leaf node with the sequence number i. The comparing the ith first hash bucket and the ith second hash bucket to determine the inconsistent row identifier of the first data table and the second data table may include:
determining that a row identifier included in one hash group in the ith first hash bucket is the same as a row identifier included in one hash group in the ith second hash bucket, but the corresponding hash values are different;
and determining the row identifier as the row identifier of the inconsistency of the first data table and the second data table.
In the above technical solution, after locating the inconsistent row identifier, the inconsistent row data needs to be further located. In this embodiment, the process of locating inconsistent row data may be such that: firstly, extracting line data identification from inconsistent line data, searching corresponding line data from a source end database and a target end database respectively through the line data identification, wherein the line data comprises data of each line, and finding out the inconsistent line data in a direct ratio mode.
It is understood that the data consistency comparison performed by the data comparison module is a top-down comparison process. After specific inconsistent data is found, the data storage module stores the information into a nonvolatile storage medium for inquiry when needed.
By way of example, the process of consistency comparison of two mercker trees according to the method provided in step 130 above is described below in connection with the two mercker trees shown in fig. 7.
Two merkel trees as shown in fig. 7 may be obtained according to the above step 110 and the above step 120, and are referred to as merkel tree #1 (i.e., an example of the above first merkel tree) and merkel tree #2 (i.e., an example of the above second merkel tree) for convenience of description.
Hereinafter, the hash value of each node will be described by taking the merkel tree #1 as an example. Similarly, the hash value of each node of the merkel tree #2 is obtained by a similar method. The 4 leaf nodes #1 (i.e., an example of the first leaf node) of the merkel tree #1 correspond to the 4 hash buckets #1 (i.e., an example of the first hash bucket), and the 4 leaf nodes #1 are respectively referred to as a 1 st leaf node #1, a 2 nd leaf node #1, a 3 rd leaf node #1, and a 4 th leaf node #1 from left to right. From left to right, the 4 hash buckets #1 are respectively referred to as 1 st hash bucket # 1, 2 nd hash bucket # 1, 3 rd hash bucket # 1, and 4 th hash bucket # 1. Each hash bucket #1 includes 2 hash groups, each hash group including a row identification and a hash value. The hash value of the 1 st leaf node #1 is obtained by performing an exclusive-or operation on the hash values included in the 2 hash groups in the 1 st hash bucket #1, that is, the hash value N0 of the 1 st leaf node #1 is XOR (1,5), where XOR (1,5) denotes performing an exclusive-or operation on the hash value corresponding to the row identifier 1 and the hash value corresponding to the row identifier 5, that is, XOR (1,5) is equal to XOR (011,101) (011,101). The hash value of the intermediate node #1 is obtained by performing a hash operation on the hash values of its two child nodes. For example, N4 ═ H (N0, N1) represents the hash value of one intermediate node #1 of the mercker tree #1, and H (N0, N1) represents the result of hash operations performed on the hash values of two leaf nodes #1 of the one intermediate node #1 (i.e., N0 and N1). The hash value of the root node #1 is obtained by performing a hash operation on the hash values of its two child nodes. For example, H (N4, N5) represents the hash value of the root node #1 of the mercker tree # 1. Referring to fig. 7, the process of performing consistency check on the mercker tree #1 and the mercker tree #2 may be: first, whether the hash values of the root nodes are the same is compared. Since the 4 th hash bucket #1 of the merkel tree #1 is different from the 4 th hash bucket #2 of the merkel tree #2 (i.e., the hash values corresponding to the hash group with row number "6" are different), the obtained hash value of the root node #1 of the merkel tree #1 (i.e., H (N4, N5) in the merkel tree # 1) is different from the hash value of the root node #2 of the merkel tree #2 (i.e., H (N4, N5) in the merkel tree # 2). Next, a leaf node is sought from top to bottom that determines that the Mercker tree #1 is inconsistent with the Mercker tree # 2. As can be seen from fig. 7, it can be determined that the 4 th leaf node #1 is not identical to the 4 th leaf node # 2. And then determining that the inconsistent row identifier is '6' according to the hash buckets corresponding to the 4 th leaf node #1 and the 4 th leaf node # 2.
The ith leaf node #1 may be a leaf node #1 having a sequence number i, and the ith hash bucket #1 may be a hash bucket #1 having a sequence number i, where i is 1,2,3, or 4.
It should be understood that fig. 7 is illustrative only and does not constitute any limitation to the present application. For example, hash bucket #1 and hash bucket #1 shown in fig. 7 may each include a different number of hash groups. For example, hash values included in hash bucket #1 and hash bucket #1 shown in fig. 7 may also be hash values of a larger magnitude.
In the present application, the types of the first database and the second database involved in the above steps 110 to 130 are not particularly limited.
Optionally, the first database and the second database may be heterogeneous databases.
Optionally, the first database and the second database may be isomorphic databases.
Alternatively, the first database may be a relational database or a non-relational database, and the second database may be a relational database or a non-relational database.
For example, the first database may be a relational database and the second database may be a non-relational database. For example, the first database may be a relational database and the second database may be a relational database.
Relational databases generally store data in a tabular form, so data can be directly extracted from the relational databases to construct first data tables, and non-relational databases generally store data in a non-tabular form (e.g., documents, key values, graph structures, etc.), so before extracting data from the non-relational databases to construct first data tables, data to be checked in the non-relational databases needs to be converted into a form stored by tables, wherein each row of a table may include a row identifier and one or more rows of data.
In most scenarios, data in the production database (i.e., an example of the first database) is dynamically changed, and during the data verification process, the verified data is changed again, and needs to be re-verified. If the method of updating on the original merkel tree is adopted, the updating of each leaf will cause the recalculation of hash values of the layers on the original merkel tree, when the number of layers of the merkel tree is higher, the calculation overhead will increase sharply, and another factor to be considered is that when the incremental data increases rapidly, the original merkel tree needs to be expanded and rebuilt in consideration of the overall verification performance, which will affect the verified data.
The data verification method provided by the application can also be used for respectively verifying the full data and the incremental data. Specifically, the total data is verified by using the higher-layer-number Merck tree, and the incremental data is verified by using the lower-layer-number Merck tree, so that the calculation overhead can be effectively reduced.
In the present application, the method for obtaining the incremental data to be verified from the data table is not particularly limited. For example, an existing method of obtaining incremental data to be verified in a data table may be employed. For example, other methods of obtaining incremental data to be verified may be used.
For example, the data table #1 (i.e., an example of the first data table) includes 10 lines of data, the data table #2 (i.e., an example of the second data table) includes 10 lines of data, and the data table #2 is obtained by copying the data table # 1. At time #1, the method of the above steps 110 to 130 may be adopted to generate the merck tree #1 (i.e., an example of the above first merck tree) from the data table #1, generate the merck tree #2 (i.e., an example of the above second merck tree) from the data table #2, and determine whether the data table #1 and the data table #2 have consistency by comparing the merck tree #1 with the merck tree # 2. At a time immediately after time #1, when the data in the 5 th to 10 th rows in the data table #1 is updated, it can be said that the updated data table #1 is the data table #3 (i.e., an example of the first data table), and the data table #4 (i.e., an example of the second data table) is obtained by copying the data table # 3. In this case, the consistency check may be performed only on the incremental data by using the method of step 110 to step 130 described above. Specifically, the mercker tree #3 (i.e., an example of the first mercker tree described above) is generated from the data table #3, the mercker tree #4 (i.e., an example of the second mercker tree described above) is generated from the data table #4, and whether the data table #3 and the data table #4 have consistency is determined by comparing the mercker tree #3 with the mercker tree # 4.
The data verification method provided by the application can better meet the requirements of accurately and quickly verifying the mass data in different scenes (such as online data or offline data). In the above technical solution, the hash value of each first leaf node of the first merkel tree corresponding to the first data table is determined according to the corresponding first hash bucket, and the hash value of each second leaf node of the second merkel tree corresponding to the second data table is determined according to the corresponding second hash bucket. Since the hash rule for performing hash partitioning on the second data table according to the row identifier to obtain the N second hash buckets is the same as the hash rule for performing hash partitioning on the first data table according to the row identifier to obtain the N first hash buckets, it can be ensured that data corresponding to the same row identifier in the second data table and the first data table are respectively mapped to the second hash bucket and the first hash bucket having the same sequence number, so that the generated first merkel tree and the second merkel tree have the same structure. Therefore, whether the first data table and the second data table are consistent or not can be determined directly by comparing the hash values of the two root nodes of the Merckel tree. When the first and second merkel trees are generated, since the hash values of the leaf nodes of the two merkel trees are obtained by performing exclusive or operation on the data in the corresponding hash buckets, the data sorting process can be avoided when the consistency check is performed on the first and second data tables.
When the hash values of the two root nodes of the Mercker tree are determined to be inconsistent, the data verification method provided by the application can further determine inconsistent row identifiers and corresponding row data by comparing the hash values of the leaf nodes of the two Mercker trees. In order to meet the requirement of data verification of different scenes, the data verification method provided by the application can also adaptively adjust the tree heights of the first Mercker tree and the second Mercker tree according to the size of the data set to be verified.
Next, a verification process of the data verification method 200 provided in the present application is specifically described with reference to fig. 8.
As shown in fig. 8, the method 200 includes steps 210 to 290, and the steps 210 to 290 are described below. The execution subject of steps 210 to 290 may be the data verification apparatus 130 shown in fig. 2.
Step 210, begin.
Step 210 above represents the beginning of the data consistency check.
In step 220, a check mark bit is punched into the data to be copied in database #1 (i.e., an example of the first database in the method 100).
The method for punching check marks into the data to be copied can be the same as the existing method, and is not described herein again.
In step 230, database #2 (i.e., an example of the second database in the method 100) copies the data to be copied.
In step 240, the database #1 detects the flag bit and obtains the data table # 1.
The method for obtaining the data table #1 after detecting the flag bit from the database #1 may be the same as the conventional method, and will not be described herein again.
In step 250, database #2 detects the flag bit and retrieves data table # 2.
Step 260, generate the Mercker tree #1 (i.e., an example of the first Mercker tree in the method 100 described above) according to the data sheet # 1.
Step 270, generating a Mercker tree #2 (i.e., an example of the second Mercker tree in the method 100 described above) from the data sheet #2
The method for determining the mercker tree in the above step 260 and the above step 270 is the same as the method for determining the mercker tree in the method 100, and refer to the above step 110 specifically, which is not described in detail here.
In step 280, it is determined whether the hash value of the root node of the merkel tree #1 is the same as the hash value of the root node of the merkel tree # 2.
The above determination of whether the hash value of the root node of the mercker tree #1 is the same as the hash value of the root node of the mercker tree #2 includes:
in a case where it is determined that the hash value of the root node of the mercker tree #1 is the same as the hash value of the root node of the mercker tree #2, step 281 is performed;
in a case where it is determined that the hash value of the root node of the mercker tree #1 is not identical to the hash value of the root node of the mercker tree #2, steps 282 and 283 are executed.
In step 281, it is determined that data table #1 and data table #2 have consistency.
Step 282, comparing the hash value of the leaf node of the merkel tree #1 with the hash value of the leaf node of the merkel tree #2, and determining inconsistent row identifiers.
See the method in step 130 above for details, which are not described in detail here.
In this application, the execution order of executing steps 281 and 282 is not particularly limited.
For example, after step 280 is performed, step 281 may be performed before step 282 is performed. For example, after step 280 is performed, step 282 may be performed before step 281 is performed.
Step 283, the inconsistent row data is determined from database #1 and database #2 based on the inconsistent row identification determined above.
See the method in step 130 above for details, which are not described in detail here.
Optionally, after step 283, the determined inconsistent row identifier and the corresponding row data may also be stored in a memory module of the data checking apparatus, such as the memory module 136 shown in fig. 2.
And step 290, ending.
Step 290 above represents the end of the data consistency check.
It should be understood that fig. 8 is only an illustration and does not limit the process of data verification provided in the present application. For example, in some implementations, steps 282 and 283 may not be performed after determining that data table #1 and data table #2 are inconsistent.
FIG. 9 is a schematic illustration of a Mercker tree determined according to the method provided herein.
As shown in fig. 9, 4 mercker trees are included, which are mercker tree #3 (i.e., one example of the first mercker tree), mercker tree #4 (i.e., another example of the first mercker tree), mercker tree #5 (i.e., one example of the second mercker tree), and mercker tree #6 (i.e., another example of the second mercker tree). Wherein the height of the Merck tree #3 and the Merck tree #5 is 3, and the height of the Merck tree #4 and the Merck tree #6 is 2. For the detailed description of the merkel tree #3, the merkel tree #4, the merkel tree #5 and the merkel tree #6, reference may be made to the description of fig. 7 above, and detailed description thereof is omitted here.
The merkel tree #3 is understood to be a merkel tree generated from the total amount of data in the data table #1 (i.e., the example of the first data table) at the time # 1. Merkel tree #4, it is understood that at time #2, the merkel tree is generated from the incremental data in data table #1, time #2 being a time subsequent to time #1, and the incremental data in data table #1 at time #2 includes the corresponding row of data in data table #1 identified by the following row: "2", "6", "7" and "8". The merkel tree #5 is understood to be a merkel tree generated from the total amount of data in the data table #2 (i.e., an example of the second data table) at the time # 1. Merkel tree #6, which may be understood as a merkel tree generated at time #2 from the incremental data in data table #2, the incremental data in data table #1 at time #2 includes the corresponding row of data in data table #2 identified by the following row: "2", "6", "7" and "8". The data table #2 is a data table obtained by migrating or synchronizing the data table # 1.
The method for generating the above-mentioned Mercker trees #3, Mercker trees #4, Mercker trees #5, and Mercker trees #6 can be referred to the method 100, and will not be described in detail herein.
When the consistency check needs to be performed on the data table #1 and the data table #2, the method comprises the following steps:
whether the full amount data in data table #1 and data table #2 are consistent can be determined by comparing the mercker tree #3 and the mercker tree # 5;
whether the incremental data in data table #1 and data table #2 are consistent can be determined by comparing the merkel tree #4 and the merkel tree # 6.
Optionally, after determining that the total data in the data table #1 and the data table #2 are inconsistent, the inconsistent row identifier may be determined by comparing the leaf node of the mercker tree #3 with the leaf node of the mercker tree #5, and the row data corresponding to the inconsistent row identifier may be determined from the data table #1 and the data table #2 according to the inconsistent row identifier.
Optionally, after determining that the incremental data in the data table #1 and the data table #2 are inconsistent, the inconsistent row identifier may be determined by comparing the leaf node of the mercker tree #4 with the leaf node of the mercker tree #6, and the row data corresponding to the inconsistent row identifier may be determined from the data table #1 and the data table #2 according to the inconsistent row identifier.
In the embodiment of the application, the incremental data are subjected to data verification by adopting the independent Merckel treelets, so that the calculation cost can be further saved, and the data consistency verification efficiency can be improved.
The data verification method provided by the present application, and a system architecture suitable for the method, etc. are described in detail above with reference to fig. 1 to 9. The data verification apparatus, the data verification device, and the data verification system provided in the present application are described in detail below with reference to fig. 10 to 12. It is to be understood that the description of the method embodiments corresponds to the description of the apparatus embodiments, and therefore reference may be made to the preceding method embodiments for parts not described in detail.
In the embodiment of the present application, the data verification apparatus should include a processing unit and a determining unit. The data verification device may be the data verification device 130 described above.
Optionally, in some implementation manners, the data verification apparatus may further include a transceiver unit.
Next, referring to fig. 10, the data verification apparatus including the processing unit and the determining unit will be described as an example.
Fig. 10 is a schematic structural diagram of a data verification apparatus 1000 according to the present application.
As shown in fig. 10, the apparatus 1000 includes: a processing unit 1001 and a determination unit 1002.
A processing unit 1001, configured to process a first data table in a first database to generate a first tacher tree, where each row of the first data table includes a row identifier and row data, the first tacher tree includes N first leaf nodes, the N first leaf nodes correspond to N first hash buckets one to one, a hash value of each first leaf node is determined according to a corresponding first hash bucket, the N first hash buckets are obtained by performing hash partitioning on the first data table according to the row identifier, any two first hash buckets are different, and N is a positive integer greater than or equal to 2;
the processing unit 1001 is further configured to process a second data table in a second database to generate a second merkel tree, where the second data table is obtained by synchronizing or migrating the first data table to the second database, the second merkel tree includes N second leaf nodes, the N second leaf nodes correspond to N second hash buckets one to one, a hash value of each second leaf node is determined according to a corresponding second hash bucket, the N second hash buckets are obtained by performing hash partitioning on the second data table according to a row identifier, hash rules of the N second hash buckets obtained by performing hash partitioning on the second data table according to the row identifier are the same as hash rules of the N first hash buckets obtained by performing hash partitioning on the first data table according to the row identifier, and any two second hash buckets are different;
a determining unit 1002, configured to compare the first merkel tree with the second merkel tree, and determine whether the first data table is consistent with the second data table.
Optionally, the first data table includes M rows, M is a positive integer greater than or equal to 1,
the processing unit 1001 is further configured to:
performing hash processing on the M rows to obtain M first hash groups, wherein the M first hash groups correspond to the M rows one by one, each first hash group comprises a row identifier in the M rows and a hash value of row data corresponding to the row identifier, and the row identifiers of each first hash group are different;
mapping the M first hash groups to the N first hash buckets;
the determining unit 1002 is further configured to:
determining hash values of the N first leaf nodes according to the N first hash buckets;
the processing unit 1001 is further configured to:
and generating the first Mercker tree according to the Hash values of the N first leaf nodes.
Optionally, the hash value of each first leaf node is obtained by performing an exclusive or operation on the hash values included in the first hash group included in the corresponding first hash bucket.
Optionally, the determining unit 1002 is further configured to:
determining the hash value of the first Merck tree root node and the hash value of the second Merck tree root node;
if the hash value of the first Merck tree root node is the same as the hash value of the second Merck tree root node, determining that the first data table is consistent with the second data table;
and if the hash value of the first Merck tree root node is not the same as the hash value of the second Merck tree root node, determining that the first data table is inconsistent with the second data table.
Optionally, the determining unit 1002 is further configured to:
determining that the hash value of an ith first leaf node is different from the hash value of an ith second leaf node, wherein the hash value of the ith first leaf node is determined according to the ith first hash bucket, the hash value of the ith second leaf node is determined according to the ith second hash bucket, i is a positive integer, and i is more than or equal to 1 and less than or equal to N;
comparing the ith first hash bucket with the ith second hash bucket, and determining the inconsistent row identifier of the first data table and the second data table;
the processing unit 1001 is further configured to:
and respectively inquiring the row data corresponding to the inconsistent row identification from the first database and the second database according to the inconsistent row identification.
Optionally, the first data table includes a full amount of data in at least one data table in the first database.
Optionally, the first data table includes incremental data in at least one data table in the first database.
Optionally, the height of the first merkel tree is associated with the first data table.
Optionally, the first database and the second database are heterogeneous databases or homogeneous databases.
Optionally, the first database is a relational database or a non-relational database, and the second database is a relational database or a non-relational database.
In the following, referring to fig. 11, the data verification apparatus including a transceiver, a processor and a memory is described as an example.
Fig. 11 is a schematic structural diagram of a data verification apparatus 1000 provided in the present application. As shown in fig. 11, the apparatus 1000 includes: a transceiver 1010, a processor 1020, and a memory 1030. Wherein, the transceiver 1010, the processor 1020 and the memory 1030 communicate with each other via the internal connection path to transmit control and/or data signals, the memory 1030 is used for storing a computer program, and the processor 1010 is used for calling and running the computer program from the memory 1030 to control the transceiver 1020 to transmit and receive signals.
Specifically, the transceiver 1010 may be configured to obtain the first data table and the second data table, which is not described herein again.
Specifically, the functions of the processor 1020 correspond to the specific functions of the processing unit 1001 and the determining unit 1002 shown in fig. 10, and are not described herein again.
In the embodiment of the present application, the data verification device should include a processor. The data verification device may be any one of the terminal devices described above.
Optionally, in some implementations, the data verification device may further include a transceiver.
Optionally, in some implementations, the data verification device may further include a memory.
Fig. 12 is a schematic structural diagram of a system 1200 provided in the present application. As shown in fig. 12, the system 1200 includes: the data verification apparatus 1000 or the data verification device 1100 above. Optionally, the system 1200 may further include the above first database and the above second database.
The present application provides a computer program product, which when run on the data verification apparatus 1310, enables the data verification apparatus 1310 to execute the method 100 and/or the method 200 in the above method embodiments.
Those of ordinary skill in the art will appreciate that the various method steps and elements described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both, and that the steps and elements of the various embodiments have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, the disclosed system, apparatus and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the unit is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer program instructions. When loaded and executed on a computer, produce, in whole or in part, the procedures or functions according to the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer program instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wire or wirelessly. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The available media may be magnetic media (e.g., floppy disks, hard disks, tapes), optical media (e.g., Digital Video Disks (DVDs), or semiconductor media (e.g., solid state disks), among others.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In addition, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship; the term "at least one", as used herein, may mean "one" and "two or more", e.g., at least one of A, B and C, may mean: a exists alone, B exists alone, C exists alone, A and B exist together, A and C exist together, C and B exist together, A and B exist together, and A, B and C exist together, which are seven cases.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (24)

1. A method for data verification, the method comprising:
processing a first data table in a first database to generate a first Merck tree, wherein each row of the first data table comprises a row identifier and row data, the first Merck tree comprises N first leaf nodes, the N first leaf nodes correspond to N first hash buckets one by one, the hash value of each first leaf node is determined according to the corresponding first hash bucket, the N first hash buckets are obtained by carrying out hash partitioning on the first data table according to the row identifier, any two first hash buckets are different, and N is a positive integer greater than or equal to 2;
processing a second data table in a second database to generate a second merkel tree, where the second data table is obtained by synchronizing or migrating the first data table to the second database, the second merkel tree includes N second leaf nodes, the N second leaf nodes correspond to N second hash buckets one by one, a hash value of each second leaf node is determined according to a corresponding second hash bucket, the N second hash buckets are obtained by performing hash partitioning on the second data table according to a row identifier, hash rules of the N second hash buckets obtained by performing hash partitioning on the second data table according to the row identifier are the same as hash rules of the N first hash buckets obtained by performing hash partitioning on the first data table according to the row identifier, and any two second hash buckets are different;
comparing the first and second merkel trees to determine whether the first and second data tables are consistent.
2. The method of claim 1, wherein the first data table comprises M rows, M being a positive integer greater than or equal to 1, and wherein the processing the first data table in the first database to generate the first mercker tree comprises:
performing hash processing on the M rows to obtain M first hash groups, wherein the M first hash groups correspond to the M rows one by one, each first hash group comprises a row identifier in the M rows and a hash value of row data corresponding to the row identifier, and the row identifiers of each first hash group are different; mapping the M first hash groups to the N first hash buckets;
determining hash values of the N first leaf nodes according to the N first hash buckets;
and generating the first Mercker tree according to the Hash values of the N first leaf nodes.
3. The method of claim 2,
the hash value of each first leaf node is obtained by performing exclusive or operation on the hash values included in the first hash group included in the corresponding first hash bucket.
4. The method of any one of claims 1-3, wherein said comparing said first Merck tree to said second Merck tree to determine whether said first data table is consistent with said second data table comprises:
determining a hash value of the first Merck tree root node and a hash value of the second Merck tree root node;
determining that the first data table is consistent with the second data table if the hash value of the first Merck tree root node is the same as the hash value of the second Merck tree root node;
and if the hash value of the first Merck tree root node is not the same as the hash value of the second Merck tree root node, determining that the first data table is inconsistent with the second data table.
5. The method of claim 4, wherein after determining that the first data table is inconsistent with the second data table, the method further comprises:
determining that the hash value of an ith first leaf node is different from the hash value of an ith second leaf node, wherein the hash value of the ith first leaf node is determined according to the ith first hash bucket, the hash value of the ith second leaf node is determined according to the ith second hash bucket, i is a positive integer, and i is more than or equal to 1 and less than or equal to N;
comparing the ith first hash bucket with the ith second hash bucket, and determining the inconsistent row identification of the first data table and the second data table;
and respectively inquiring the row data corresponding to the inconsistent row identifiers from the first database and the second database according to the inconsistent row identifiers.
6. The method of any of claims 1-5, wherein the first data table comprises a full amount of data in at least one data table in the first database.
7. The method of any of claims 1-5, wherein the first data table comprises incremental data in at least one data table in the first database.
8. The method according to any one of claims 1 to 7, wherein the height of the first Mercker tree is associated with the first data table.
9. The method of any one of claims 1-8, wherein the first database and the second database are heterogeneous databases or homogeneous databases.
10. The method of any one of claims 1-9, wherein the first database is a relational or non-relational database and the second database is a relational or non-relational database.
11. A data verification apparatus, the apparatus comprising:
the processing unit is used for processing a first data table in a first database to generate a first merkel tree, each row of the first data table comprises a row identifier and row data, the first merkel tree comprises N first leaf nodes, the N first leaf nodes correspond to N first hash buckets one by one, the hash value of each first leaf node is determined according to the corresponding first hash bucket, the N first hash buckets are obtained by carrying out hash partitioning on the first data table according to the row identifier, any two first hash buckets are different, and N is a positive integer greater than or equal to 2;
the processing unit is further used for processing a second data table in a second database to generate a second Mercker tree, the second data table is obtained by synchronizing or migrating the first data table to the second database, the second merkel tree includes N second leaf nodes, the N second leaf nodes are in one-to-one correspondence with N second hash buckets, a hash value of each second leaf node is determined according to the corresponding second hash bucket, the N second hash buckets are obtained by carrying out hash partitioning on the second data table according to the row identification, performing hash partitioning on the second data table according to the row identifier to obtain the hash rules of the N second hash buckets, and performing hash partitioning on the first data table according to the row identifier to obtain the hash rules of the N first hash buckets, wherein any two second hash buckets are different;
a determining unit, configured to compare the first merkel tree with the second merkel tree, and determine whether the first data table and the second data table are consistent.
12. The apparatus of claim 11, wherein the first data table comprises M rows, M being a positive integer greater than or equal to 1,
the processing unit is further to:
performing hash processing on the M rows to obtain M first hash groups, wherein the M first hash groups correspond to the M rows one by one, each first hash group comprises a row identifier in the M rows and a hash value of row data corresponding to the row identifier, and the row identifiers of each first hash group are different;
mapping the M first hash groups to the N first hash buckets;
the determination unit is further configured to:
determining hash values of the N first leaf nodes according to the N first hash buckets;
the processing unit is further to:
and generating the first Mercker tree according to the Hash values of the N first leaf nodes.
13. The apparatus of claim 12,
the hash value of each first leaf node is obtained by performing exclusive or operation on the hash values included in the first hash group included in the corresponding first hash bucket.
14. The apparatus according to any of claims 11-13, wherein the determining unit is further configured to:
determining a hash value of the first Merck tree root node and a hash value of the second Merck tree root node;
determining that the first data table is consistent with the second data table if the hash value of the first Merck tree root node is the same as the hash value of the second Merck tree root node;
and if the hash value of the first Merck tree root node is not the same as the hash value of the second Merck tree root node, determining that the first data table is inconsistent with the second data table.
15. The apparatus of claim 14,
the determination unit is further configured to:
determining that the hash value of an ith first leaf node is different from the hash value of an ith second leaf node, wherein the hash value of the ith first leaf node is determined according to the ith first hash bucket, the hash value of the ith second leaf node is determined according to the ith second hash bucket, i is a positive integer, and i is more than or equal to 1 and less than or equal to N;
comparing the ith first hash bucket with the ith second hash bucket, and determining the inconsistent row identification of the first data table and the second data table;
the processing unit is further to:
and respectively inquiring the row data corresponding to the inconsistent row identifiers from the first database and the second database according to the inconsistent row identifiers.
16. The apparatus of any of claims 11-15, wherein the first data table comprises a full amount of data in at least one data table in the first database.
17. The apparatus of any of claims 11-16, wherein the first data table comprises incremental data in at least one data table in the first database.
18. The apparatus of any one of claims 11-17, wherein a height of the first merkel tree is associated with the first data table.
19. The apparatus of any of claims 11-18, wherein the first database and the second database are heterogeneous databases or homogeneous databases.
20. The apparatus of any of claims 11-19, wherein the first database is a relational or non-relational database and the second database is a relational or non-relational database.
21. A data verification device comprising at least one processor and a communications interface, the at least one processor being configured to execute a computer program or instructions to cause the performance of the data verification method as claimed in any one of claims 1 to 10.
22. The data verification apparatus of claim 21, further comprising at least one memory coupled with the at least one processor, the computer program or instructions stored in the at least one memory.
23. A computer-readable storage medium storing computer instructions which, when executed, implement the method of any one of claims 1 to 10.
24. A system comprising a data verification device as claimed in claim 21 or 22.
CN202011040390.7A 2020-09-28 2020-09-28 Data verification method, device and system Pending CN114281793A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011040390.7A CN114281793A (en) 2020-09-28 2020-09-28 Data verification method, device and system
PCT/CN2021/120282 WO2022063223A1 (en) 2020-09-28 2021-09-24 Data verification method, apparatus, and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011040390.7A CN114281793A (en) 2020-09-28 2020-09-28 Data verification method, device and system

Publications (1)

Publication Number Publication Date
CN114281793A true CN114281793A (en) 2022-04-05

Family

ID=80846243

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011040390.7A Pending CN114281793A (en) 2020-09-28 2020-09-28 Data verification method, device and system

Country Status (2)

Country Link
CN (1) CN114281793A (en)
WO (1) WO2022063223A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024060677A1 (en) * 2022-09-23 2024-03-28 超聚变数字技术有限公司 Data verification method and electronic device

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114912150A (en) * 2022-05-13 2022-08-16 北京航星永志科技有限公司 Data processing and acquiring method and device and electronic equipment
CN116860825B (en) * 2023-06-14 2024-01-26 北京科技大学 Verifiable retrieval method and system based on blockchain
CN117251460B (en) * 2023-08-10 2024-04-05 上海栈略数据技术有限公司 Data consistency check system for graph database and relational database
CN117194390B (en) * 2023-11-08 2024-02-09 建信金融科技有限责任公司 Database migration method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894079A (en) * 2010-07-15 2010-11-24 哈尔滨工程大学 Hash tree memory integrity protection method of variable length storage block
IT201600106953A1 (en) * 2016-10-24 2018-04-24 Aliasnet S R L METHOD OF REGISTRATION OF A TRANSACTION, IN PARTICULAR OF SIGNATURE OF AN ELECTRONIC DOCUMENT
CN108427601A (en) * 2017-02-13 2018-08-21 北京航空航天大学 A kind of cluster transaction processing method of privately owned chain node
US10826682B2 (en) * 2018-07-03 2020-11-03 Servicenow, Inc. Multi-instance architecture supporting trusted blockchain-based network
CN110958109B (en) * 2019-10-12 2023-09-19 上海电力大学 Light dynamic data integrity auditing method based on hierarchical merck hash tree
CN110989994B (en) * 2019-11-18 2024-04-26 腾讯科技(深圳)有限公司 Code version management method, device, terminal and storage medium based on block chain
CN111625258B (en) * 2020-05-22 2021-08-27 深圳前海微众银行股份有限公司 Mercker tree updating method, device, equipment and readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024060677A1 (en) * 2022-09-23 2024-03-28 超聚变数字技术有限公司 Data verification method and electronic device

Also Published As

Publication number Publication date
WO2022063223A1 (en) 2022-03-31

Similar Documents

Publication Publication Date Title
CN114281793A (en) Data verification method, device and system
EP3678346B1 (en) Blockchain smart contract verification method and apparatus, and storage medium
CN106126722B (en) A kind of prefix compound tree and design method based on verifying
US10628449B2 (en) Method and apparatus for processing database data in distributed database system
CN110334152B (en) Data synchronization method and device and server
JP2020507866A (en) Data processing method and device
US20210109917A1 (en) System and Method for Processing a Database Query
CN109416694A (en) The key assignments storage system effectively indexed including resource
CN110235162B (en) Block chain system data processing method and block generation method
CN106294421A (en) A kind of data write, read method and device
CN109934712A (en) Account checking method, account checking apparatus and electronic equipment applied to distributed system
CN112286963A (en) Trusted inquiry system for block chain terminal data and implementation method thereof
CN112015806A (en) Method and device for storing data by block chain
US10289723B1 (en) Distributed union all queries
US8407255B1 (en) Method and apparatus for exploiting master-detail data relationships to enhance searching operations
CN110806979B (en) Interface return value checking method, device, equipment and storage medium
CN115130043B (en) Database-based data processing method, device, equipment and storage medium
CN114331745B (en) Data processing method, system, readable storage medium and electronic device
CN109815047A (en) A kind of method and relevant apparatus of data processing
CN115795563A (en) State data checking method and device
CN107256252A (en) Third-party multidimensional data migration method and device
CN113342647A (en) Test data generation method and device
CN107451179B (en) Query method and system for block chain for increasing overall error of block
CN112000671A (en) Block chain-based database table processing method, device and system
US10372917B1 (en) Uniquely-represented B-trees

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination