CN116150175A

CN116150175A - Heterogeneous data source-oriented data consistency verification method and device

Info

Publication number: CN116150175A
Application number: CN202310410421.0A
Authority: CN
Inventors: 孙浩; 李筱沛
Original assignee: Accumulus Technologies Tianjin Co Ltd
Current assignee: Accumulus Technologies Tianjin Co Ltd
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-05-23

Abstract

The invention provides a data consistency verification method and device for heterogeneous data sources, wherein the method comprises the following steps: acquiring first data to be verified in a source database table from a source database, acquiring second data to be verified in a target database table from a target database, and respectively calculating a first line number and a first checksum of the first data and a second line number and a second checksum of the second data; comparing the first line number with the second line number, and comparing the first checksum with the second checksum to obtain a comparison result; and determining whether the first data and the second data pass the consistency check according to the comparison result. Therefore, the accuracy of data consistency verification can be further improved by respectively comparing the line number of the first data and the second data in the source database table and the target database table and the parameters of the checksum, so that the consistency of the data of the source end and the target end after data migration or data synchronization is ensured.

Description

Heterogeneous data source-oriented data consistency verification method and device

Technical Field

The embodiment of the invention relates to the technical field of databases, in particular to a data consistency verification method and device for heterogeneous data sources.

Background

Data consistency checking is any necessary function involving data synchronization or migration tools. Data consistency checks can be categorized into consistency checks for isomorphic data sources and consistency checks for heterogeneous data sources. Because the upstream and downstream table structures and the grammar rules of SQL are basically the same, an industry open source tool can be used for data consistency verification, such as MySQL- > MySQL verification tool pt-table-checksums, mySQL- > TiDB verification tool sync_diff_entity. Heterogeneous data sources are different due to the structure of the upstream and downstream tables, SQL grammar rules and the like, such as: when consistency verification is performed on TiDB- > ClickHouse data synchronization, the verification tool cannot be used. For example: if the sync-diff-entity of the PingCAP is used for consistency verification of heterogeneous data sources, the data sources can be directly exited due to inconsistent table structures, and the verification of the data layer cannot be performed.

Therefore, in performing data consistency verification on heterogeneous data sources, there is still a lack of verification tools available for independent use in the prior art. How to check the data layer of heterogeneous data sources becomes a technical problem to be solved.

Disclosure of Invention

The embodiment of the invention provides a data consistency verification method and device for heterogeneous data sources, which are used for solving the problem that verification tools such as sync_diff_entity and the like cannot verify the data layer of the heterogeneous data sources in the related technology.

In a first aspect, an embodiment of the present invention provides a method for checking data consistency facing to heterogeneous data sources, where the method includes:

acquiring first data to be verified in a source database table from a source database, and acquiring second data to be verified in a target database table from a target database, wherein the first data and the second data have a corresponding relationship;

respectively calculating a first line number and a first checksum of the first data and a second line number and a second checksum of the second data;

comparing the first line number with the second line number, and comparing the first checksum with the second checksum to obtain a comparison result;

and determining whether the first data and the second data pass through consistency verification according to the comparison result.

Preferably, determining whether the first data and the second data pass a consistency check according to the comparison result includes:

if the comparison result is: the first line number is consistent with the second line number, and the first checksum is consistent with the second checksum, then the first data and the second data are determined to pass consistency verification;

If the first line number is inconsistent with the second line number and the first checksum is inconsistent with the second checksum, determining that the first data and the second data do not pass consistency verification;

if the first line number is inconsistent with the second line number and the first checksum is consistent with the second checksum, determining that the first data and the second data do not pass the consistency check;

and if the first line number is consistent with the second line number and the first checksum is inconsistent with the second checksum, determining that the first data and the second data do not pass the consistency check.

Preferably, after comparing the first row number with the second row number and comparing the first checksum with the second checksum to obtain a comparison result, the method further includes:

generating a verification report according to the comparison result; wherein the verification report includes at least one of: the comparison result, the parameters of the source database table, the parameters of the target database table, the first line number and the second line number.

Preferably, the method further comprises, before obtaining the first data to be verified in the source database table from the source database and obtaining the second data to be verified in the target database table from the target database:

Determining a verification pattern, wherein the verification pattern comprises at least one of: a full-quantity check mode, a full-quantity data sampling check mode, and a latest data sampling check mode;

and determining the first data and the second data according to the verification mode.

Preferably, determining the first data and the second data according to the verification pattern includes:

when the verification mode is a full verification mode, determining that the first data is the full data of the source database table and determining that the second data is the full data of the target database table;

when the verification mode is a full data sampling verification mode, determining that the first data is random sampling data aiming at the full data of the source database table, and determining that the second data is random sampling data aiming at the full data of the target database table;

and when the verification mode is the latest data sampling verification mode, determining that the first data is random sampling data aiming at the latest data of the source database table, and determining that the second data is random sampling data aiming at the latest data of the target database table.

Preferably, the obtaining the first data to be verified in the source database table from the source database, and the obtaining the second data to be verified in the target database table from the target database includes:

respectively inquiring the maximum value and the minimum value of the primary key of the full data of the source database table and the full data of the target database table by using a SELECT statement;

constructing a first main key section by taking the maximum value and the minimum value of the main keys of the full data of the source database table as endpoints, and constructing a second main key section by taking the maximum value and the minimum value of the main keys of the full data of the target database table as endpoints;

the first main key interval and the second main key interval are respectively segmented, and the first data are acquired based on the segmented first main key interval; and acquiring the second data based on the segmented second main key interval.

Preferably, when the verification mode is a full-quantity verification mode, the first main key section and the second main key section are respectively subjected to segmentation processing, and the first data is acquired based on the segmented first main key section; based on the segmented second primary key interval, obtaining the second data includes:

Determining a first preset number of section sections from the first main key section and the second main key section respectively according to a preset segmentation length value, wherein the lengths of each section are equal;

taking a first preset number of interval sections determined from the first main key interval as first data;

taking a first preset number of interval sections determined from the second main key interval as second data;

calculating a first number of rows and a first checksum of the first data, and a second number of rows and a second checksum of the second data, respectively, includes:

respectively calculating a first line number and a first checksum of data of each interval section in the first main key interval; and a second row number and a second checksum of the data of each section in the second primary key section.

Preferably, when the verification mode is a full data sampling verification mode, the first main key section and the second main key section are respectively processed in a segmentation mode, and the first data is acquired based on the segmented first main key section; based on the segmented second primary key interval, obtaining the second data includes:

sampling and selecting a second preset number of interval sections from the first main key section and the second main key section respectively according to a preset segmentation length value and a preset sampling rate;

Sampling a second preset number of interval sections selected from the first main key interval as first data;

sampling a second preset number of interval sections selected from the second main key interval as second data;

calculating a first number of rows and a first checksum of data in the source database table and a second number of rows and a second checksum of data in the target database table, respectively, includes:

Preferably, when the verification mode is the latest data sampling verification mode, the first main key section and the second main key section are respectively processed in a segmentation mode, and the first data is acquired based on the segmented first main key section; based on the segmented second primary key interval, obtaining the second data includes:

sampling and selecting a third preset number of interval sections from the latest data of the source database table and the latest data of the target database table respectively according to a preset segmentation length value and a preset sampling rate;

Sampling a third preset number of interval sections selected from the latest data of the source database table to serve as first data;

sampling a third preset number of interval segments selected from the latest data of the target database table to serve as second data;

respectively calculating a first line number and a first checksum of data of each interval section in the source database table; and a second row number and a second checksum of the data of each interval section in the target database table.

Preferably, comparing the first number of rows with the second number of rows, and comparing the first checksum with the second checksum to obtain a comparison result includes:

comparing a first line number of the data of the current first interval section to be compared with a second line number of the data of the corresponding second interval section to be compared in the target database table to obtain a first comparison result; and comparing the first checksum of the data of the current first interval section to be compared with the second checksum of the data of the second interval section to be compared to obtain a second comparison result.

Preferably, in the source database table, a first line number of the data of the current first section to be compared is compared with a second line number of the data of the corresponding second section to be compared in the target database table, so as to obtain a first comparison result; and comparing the first checksum of the data of the current first interval section to be compared with the second checksum of the data of the second interval section to be compared to obtain a second comparison result, wherein the method further comprises the steps of:

if at least one of the first comparison result and the second comparison result is inconsistent in comparison;

performing row-by-row comparison on the data of the first interval section to be compared and the data of the second interval section to be compared until determining a row main key ID with inconsistent checksums, and recording the row main key ID with inconsistent checksums;

if the first comparison result and the second comparison result are identical in comparison; and comparing the data of the next first interval to be compared with the data of the next first interval to be compared in the target database table, wherein the number of lines of the data of the corresponding second interval to be compared with the checksum until the comparison of the data of all the first intervals to be compared and the data of the corresponding second interval to be compared is completed.

In a second aspect, an embodiment of the present invention provides a data consistency checking device for heterogeneous data sources, where the device includes:

the acquisition module is used for acquiring first data to be checked in a source database table from a source database, and acquiring second data to be checked in a target database table from a target database, wherein the first data and the second data have a corresponding relation;

the computing module is used for respectively computing a first line number and a first checksum of the first data and a second line number and a second checksum of the second data;

the comparison module is used for comparing the first line number with the second line number and comparing the first checksum with the second checksum to obtain a comparison result;

and the determining module is used for determining whether the first data and the second data pass the consistency check according to the comparison result.

Preferably, the determining module is further configured to, if the comparison result is: the first line number is consistent with the second line number, and the first checksum is consistent with the second checksum, then the first data and the second data are determined to pass consistency verification;

The apparatus further comprises:

the generation module is used for comparing the first line number with the second line number, comparing the first checksum with the second checksum to obtain a comparison result, and generating a check report according to the comparison result; wherein the verification report includes at least one of: the comparison result, the parameters of the source database table, the parameters of the target database table, the first line number and the second line number.

Preferably, the determining module is further configured to determine a verification mode before obtaining, from the source database, first data to be verified in the source database table and obtaining, from the target database, second data to be verified in the target database table, where the verification mode includes at least one of: a full-quantity check mode, a full-quantity data sampling check mode, and a latest data sampling check mode; and determining the first data and the second data according to the verification mode.

Preferably, the determining module is further configured to determine that the first data is full data of the source database table and determine that the second data is full data of the target database table when the verification mode is a full verification mode;

Preferably, the acquiring module is further configured to query a maximum value and a minimum value of a primary key of the full data of the source database table and the full data of the target database table respectively by using a SELECT statement;

Preferably, the obtaining module is further configured to determine, when the verification mode is a full-quantity verification mode, a first preset number of interval segments from the first primary key interval and the second primary key interval with preset segment length values, where the lengths of each interval segment are equal;

Preferably, the acquiring module is further configured to sample and select a second preset number of interval segments from the first primary key interval and the second primary key interval respectively according to a preset segment length value and a preset sampling rate when the check mode is a full data sampling check mode;

Preferably, the obtaining module is further configured to sample and select a third preset number of interval segments from the latest data of the source database table and the latest data of the target database table respectively according to a preset segment length value and a preset sampling rate when the verification mode is a latest data sampling verification mode;

Preferably, the comparison module is further configured to compare, in the source database table, a first number of rows of data of a current first section to be compared with a second number of rows of data of a corresponding second section to be compared in the target database table, so as to obtain a first comparison result; and comparing the first checksum of the data of the current first interval section to be compared with the second checksum of the data of the second interval section to be compared to obtain a second comparison result.

Preferably, the comparison module is further configured to compare, in the source database table, a first number of rows of data of a current first interval to be compared with a second number of rows of data of a corresponding second interval to be compared in the target database table, so as to obtain a first comparison result; and comparing the first checksum of the data of the current first interval section to be compared with the second checksum of the data of the second interval section to be compared to obtain a second comparison result,

In a third aspect, an embodiment of the present invention provides an electronic device, including: a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor implements the steps of a heterogeneous data source oriented data consistency check method as described in the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium, where a computer program is stored, where the computer program, when executed by a processor, implements the steps of a data consistency checking method for heterogeneous data sources according to the first aspect.

Therefore, the embodiment of the invention provides a data consistency verification method for heterogeneous data sources, after the heterogeneous data sources are synchronized, consistency verification of a data layer can be performed on the heterogeneous data sources, and the accuracy of data consistency verification can be further improved by respectively comparing the line numbers of the first data and the second data in the source database table and the target database table and the parameters of the check sum, so that the consistency of the data of the source end and the target end after data migration or data synchronization is ensured.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a flowchart of a data consistency check method for heterogeneous data sources according to an embodiment of the present invention;

FIG. 2 is a flowchart of a data consistency check method for heterogeneous data sources according to an embodiment of the present invention;

FIG. 3 is a flowchart of a data consistency check method for heterogeneous data sources according to an embodiment of the present invention;

FIG. 4 is a technical architecture diagram of data consistency verification for heterogeneous data sources according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a data extraction stage according to an embodiment of the present invention;

FIG. 6 is a block diagram of a data consistency check device for heterogeneous data sources according to an embodiment of the present invention;

fig. 7 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Only the whole data of the table can be checked, the checking of the data of the sampling part is not supported, and aiming at the analysis type large table with huge data quantity, if the checking of the whole data is adopted, the script execution takes too long, the performance consumption is high and the cost is high.

Fig. 1 is a data consistency verification method for heterogeneous data sources according to an embodiment of the present invention, as shown in fig. 1, the method includes:

Step S101, first data to be verified in a source database table are obtained from a source database, and second data to be verified in a target database table are obtained from a target database;

wherein the first data and the second data have a corresponding relationship;

step S102, respectively calculating a first line number and a first checksum of the first data, and a second line number and a second checksum of the second data;

step S103, comparing the first line number with the second line number, and comparing the first checksum with the second checksum to obtain a comparison result;

step S104, determining whether the first data and the second data pass through consistency verification according to the comparison result.

It should be noted that, the heterogeneous data source according to the embodiment of the present invention includes: mySQL/TiDB- > MySQL/TiDB heterogeneous table check, mySQL/TiDB- > ClickHouse heterogeneous table check.

In one possible implementation manner, before the first data to be verified in the source database table is obtained from the source database and the second data to be verified in the target database table is obtained from the target database in step S101, the method further includes: determining a verification pattern, wherein the verification pattern comprises at least one of: a full-quantity check mode, a full-quantity data sampling check mode, and a latest data sampling check mode; and determining the first data and the second data according to the verification mode.

In step S101, firstly, DB connections of the source terminal and the target terminal need to be created respectively to perform basic configuration (such as passwords, user names, etc.) of some databases, and secondly, an appropriate verification mode can be selected according to actual needs of users, and data to be compared is determined. After the data to be compared are determined, data extraction can be performed, namely, first data to be checked in a source database table is obtained from a source database, and second data to be checked in a target database table is obtained from a target database.

In one possible implementation, determining the first data and the second data according to the check pattern includes: when the verification mode is a full verification mode, determining that the first data is the full data of the source database table, and determining that the second data is the full data of the target database table; when the verification mode is a full data sampling verification mode, determining that the first data is random sampling data aiming at the full data of the source database table, and determining that the second data is random sampling data aiming at the full data of the target database table; when the check mode is the latest data sampling check mode, the first data is determined to be the random sampling data of the latest data aiming at the source database table, and the second data is determined to be the random sampling data of the latest data aiming at the target database table. Therefore, different verification modes can be set according to actual needs, and different data to be compared are selected, so that the diversity and flexibility of comparison are improved.

In one possible implementation manner, as shown in fig. 2, step S101, obtaining, from a source database, first data to be verified in a source database table, and obtaining, from a target database, second data to be verified in a target database table includes:

step S201, respectively inquiring the maximum value and the minimum value of the primary key of the full data of the source database table and the full data of the target database table by using the SELECT statement;

step S202, constructing a first main key section by taking the maximum value and the minimum value of the main keys of the full data of the source database table as endpoints, and constructing a second main key section by taking the maximum value and the minimum value of the main keys of the full data of the target database table as endpoints;

step S203, respectively carrying out segmentation processing on the first main key section and the second main key section, and acquiring first data based on the segmented first main key section; and acquiring second data based on the segmented second main key interval.

When two table data are compared, if the data are not identical, the comparison is performed row by row. In order to improve the comparison efficiency, in the possible implementation manner, the table may be divided into a plurality of segments according to the index sequence, the comparison of the block (segment) data is performed first, and when the comparison is inconsistent, the row-by-row comparison of the upstream and downstream data is performed on the segment, so that the efficiency of data comparison can be improved.

A description will now be given of how data segmentation (blocks) is performed in different check modes.

In one possible implementation manner, when the verification mode is a full-quantity verification mode, step S203 performs segmentation processing on the first primary key section and the second primary key section, and obtains first data based on the segmented first primary key section; based on the segmented second primary key interval, obtaining second data includes: determining a first preset number of interval sections from the first main key section and the second main key section respectively according to a preset segmentation length value, wherein the length of each interval section is equal; taking a first preset number of interval sections determined from the first main key interval as first data; and taking the first preset number of interval sections determined from the second main key interval as second data. And correspondingly, the step S102 of calculating the first number of lines and the first checksum of the first data, and the second number of lines and the second checksum of the second data respectively may include: respectively calculating a first line number and a first checksum of data of each interval section in a first main key interval; and a second number of rows and a second checksum of the data of each section in the second primary key section.

In one possible implementation manner, when the verification mode is a full data sampling verification mode, step S203 performs segmentation processing on the first primary key interval and the second primary key interval, and obtains first data based on the segmented first primary key interval; based on the segmented second primary key interval, obtaining second data includes: sampling and selecting a second preset number of interval sections from the first main key section and the second main key section respectively according to a preset segmentation length value and a preset sampling rate; sampling a second preset number of interval sections selected from the first main key interval as first data; and taking the section with the second preset number sampled from the second main key section as second data. And correspondingly, step S102, calculating a first number of rows and a first checksum of the data in the source database table, and a second number of rows and a second checksum of the data in the target database table respectively includes: respectively calculating a first line number and a first checksum of data of each interval section in a first main key interval; and a second number of rows and a second checksum of the data of each section in the second primary key section.

In one possible implementation manner, when the check mode is the latest data sampling check mode, step S203 performs segmentation processing on the first primary key interval and the second primary key interval, and obtains the first data based on the segmented first primary key interval; based on the segmented second primary key interval, obtaining second data includes:

Sampling and selecting a third preset number of interval sections from the latest data of the source database table and the latest data of the target database table respectively according to a preset segmentation length value and a preset sampling rate; sampling a third preset number of interval sections selected from the latest data of the source database table to serve as first data; and taking a third preset number of interval segments sampled and selected from the latest data of the target database table as second data. And correspondingly, respectively calculating a first line number and a first checksum of the data in the source database table and a second line number and a second checksum of the data in the target database table, wherein the first line number and the first checksum comprise: respectively calculating a first line number and a first checksum of data of each interval section in a source database table; and a second row number and a second checksum of the data of each interval section in the target database table.

That is, when the check mode is the full check mode, the full data may be segmented according to the preset segment length value, and when the check mode is the latest data sampling check mode or the full data sampling check mode, the sampling rate/sampling number needs to be set in advance, and then the full data or the latest data is segmented according to the set sampling rate/sampling number and the preset segment length value.

In step S102, a first number of rows and a first checksum of the first data, and a second number of rows and a second checksum of the second data may be calculated, respectively. The first data is the data participating in comparison in the source database, and the second data is the data participating in comparison in the target database. Checksum (checksum) refers to the sum used to check a set of data items at a destination in the fields of data processing and data communications. These data items may be numbers or other strings that are considered numbers in the process of computing the test. Checksum (checksum) refers to the accumulation of the number of bits transmitted, from which the recipient can determine whether all data has been received when the transmission is complete, and if the values match, then it indicates that the transmission has been completed, typically in hexadecimal form, which is commonly used to ensure the integrity and accuracy of the data in communications, especially in long-range communications. The embodiment of the invention comprehensively considers the two parameters of the checksum and the line number, and can further improve the accuracy of data consistency verification.

In step S103, the first number of rows and the second number of rows may be compared, and the first checksum and the second checksum may be compared to obtain a comparison result;

In one possible implementation manner, step S103, comparing the first number of rows with the second number of rows, and comparing the first checksum with the second checksum to obtain a comparison result includes: comparing a first line number of the data of the current first interval section to be compared with a second line number of the data of the corresponding second interval section to be compared in the target database table to obtain a first comparison result; and comparing the first checksum of the data of the current first interval to be compared with the second checksum of the data of the second interval to be compared to obtain a second comparison result. Therefore, the table can be divided into a plurality of sections according to the index sequence, the comparison of block (section) data is firstly carried out, and when the comparison is inconsistent, the section is then subjected to the row-by-row comparison of upstream and downstream data, so that the data comparison efficiency can be improved.

In one possible implementation, as shown in fig. 3, the method further includes:

step S301, comparing a first line number of the data of the current first interval to be compared with a second line number of the data of the corresponding second interval to be compared in the target database table to obtain a first comparison result; comparing the first checksum of the data of the current first interval to be compared with the second checksum of the data of the second interval to be compared to obtain a second comparison result;

Step S302, if at least one of the first comparison result and the second comparison result is inconsistent in comparison;

step S303, comparing the data of the first interval section to be compared with the data of the second interval section to be compared line by line until determining the line master key ID with inconsistent checksums, and recording the line master key ID with inconsistent checksums;

step S304, if the first comparison result and the second comparison result are identical in comparison; and comparing the data of the next first interval to be compared with the data of the next first interval to be compared in the target database table, wherein the number of lines of the data of the corresponding second interval to be compared with the checksum until the comparison of the data of all the first intervals to be compared and the data of the corresponding second interval to be compared is completed.

That is, the calculation of the line number and checksum is performed on the segmented data on the upstream and downstream sides, and whether the values are identical or not is compared. If the number of lines is consistent with that of the checksum, the data of the interval is considered to be consistent, and the data of the next interval is compared; if at least one of the line numbers and the checksum is inconsistent, the data of the interval is considered inconsistent, and the interval needs to be compared line by line, namely, the checksum value of each line in the interval at the upstream and downstream is calculated respectively until the line master key ID of the checksum is found. Finally, all the line primary key IDs inconsistent with the checksum can be counted and used as references of later repair data.

In one possible implementation manner, after comparing the first number of rows with the second number of rows and comparing the first checksum with the second checksum to obtain the comparison result in step S103, the method further includes: generating a verification report according to the comparison result; wherein the verification report includes at least one of: the comparison result, the parameters of the source database table, the parameters of the target database table, the first line number and the second line number can be clearly displayed, so that the user can be assisted in carrying out subsequent judgment and data restoration.

In one possible implementation manner, step S104, according to the comparison result, determining whether the first data and the second data pass the consistency check includes: if the comparison result is that: the first line number is consistent with the second line number, and the first checksum is consistent with the second checksum, the first data and the second data are determined to pass the consistency check;

if the first line number is inconsistent with the second line number and the first checksum is inconsistent with the second checksum, determining that the first data and the second data do not pass the consistency check; if the first line number is inconsistent with the second line number and the first checksum is consistent with the second checksum, determining that the first data and the second data do not pass the consistency check; if the first line number is consistent with the second line number and the first checksum is inconsistent with the second checksum, determining that the first data and the second data do not pass the consistency check.

It will be appreciated that the first data and the second data pass the consistency check only if the first number of rows is consistent with the second number of rows and the first checksum is consistent with the second checksum.

Fig. 4 shows a technical architecture diagram for data consistency verification for heterogeneous data sources according to an embodiment of the present invention, and based on the technical architecture diagram shown in fig. 4, the entire verification flow may be summarized as follows: a data extraction phase, a data consumption phase and a report generation phase.

In the embodiment of the invention, the data sources of the upstream and downstream heterogeneous are oriented, a comparison table structure is not needed, and only the data layers are compared.

Three phases are now described:

data extraction stage (as shown in fig. 5): data extraction refers to the retrieval of data from a data source and its division into a plurality of data blocks, the data source comprising at least one of: mySQL, tiDB, and ClickHouse. Firstly, DB connections of a source end and a target end are required to be respectively established so as to perform basic configuration (such as passwords, user names and the like) of some databases, secondly, a SELECT statement can be executed to inquire the maximum value and the minimum value of a main key of an upstream source table, data are segmented based on a section formed by the maximum value and the minimum value of the main key, after a series of main key sections chunks are obtained, the chunks are pushed into channels for consumption by data comparison threads.

It should be noted that, in the embodiment of the present invention, for the clickHouse table, a materialized view dedicated for verification is used, and when the materialized view is created, the primary key is written into the order by of the materialized view. And the base table data is synchronized to the materialized view thereof without delay basically, so that the materialized view thereof is actually compared when aiming at the ClickHouse table.

In addition, a full-quantity verification mode and a sampling verification mode can be configured, and in the full-quantity verification, according to the maximum value, the minimum value and the configured chunksize value of the primary key, chunks to be verified can be directly obtained. During sampling verification, the primary key needs to be sampled after the maximum value and the minimum value of the primary key are acquired. The sampling is divided into a full data random sampling mode and a latest data sampling mode, wherein the two modes only have differences in data sampling intervals, the former refers to randomly and uniformly selecting chunks according to chunkize values and sampling rates/sampling numbers from all row data of a table to be checked, and the latter refers to randomly and uniformly selecting chunks according to chunkize values and sampling rates/sampling numbers from a specified number (preset number) of latest data. It should be noted that, in the full-size check mode, the full-size data random sampling mode, and the latest data sampling mode, the configured chunkize value and the acquired chunks number may be equal or different, and specifically, may be set by the user.

It should be noted that, if the primary key IDs of the upstream and downstream data sources are not in a one-to-one correspondence relationship, a mapping relationship (which is transmitted in a field form) between the primary key IDs of the upstream and downstream data sources needs to be transmitted in an initial configuration stage of the database, so as to ensure that the subsequent process of converting the trunk of the original primary key into the trunk with the mapping relationship according to the mapping relationship, so as to avoid that the data does not have comparability.

Data comparison stage:

and (3) data comparison, namely, the consumption thread respectively calculates the line number and the checksum of the upstream and downstream chunk data, and compares whether the values are consistent. If the number of lines is consistent with that of the checksum, the data of the chunk are considered to be consistent, and next chunk comparison is carried out; if at least one of the line numbers and the checksum is inconsistent, the chunk data is considered inconsistent, and the chunk needs to be subjected to line-by-line data comparison, that is, the value of the checksum of each line in the chunk at the upstream and downstream is calculated respectively until the line main key ID of which the checksum is not equal is found. Finally, all the line primary key IDs inconsistent with the checksum can be counted and used as references of later repair data.

It should be noted that the checksum may be calculated by an SQL function. The SQL function for calculating the checksum may be CRC32 (cyclic redundancy check), which is 3 times faster and less expensive than the MD5 algorithm.

In addition, because the characteristics of the ClickHouse are different, SQL sentences of the ClickHouse also have differences, and the character string splicing function is particularly needed, and because the ClickHouse has strong types (such as array, element ancestor, enumeration and nesting), the ClickHouse does not support implicit conversion such as digital conversion into character strings, and thus, toString/cast can be used for converting other types of displays into character strings.

When the data of the two tables are identical, the judgment can be performed by calculating the checksum of the two tables, and when the data are not identical, the row-by-row comparison is required. In order to improve the comparison efficiency, the number of lines which are compared line by line when the checksum is inconsistent is reduced. In the possible implementation manner, the table may be divided into a plurality of blocks according to the index sequence, the checksum comparison of the block data is performed first, and when the comparison is inconsistent, the line-by-line comparison of the upstream and downstream data is performed on the block, so that the data comparison efficiency may be improved.

Report generation phase:

after the data is compared, a check report is generated under the appointed catalogue. The database table parameters and the verification result of the verification can be recorded in detail in the report, the table names of the tables with the same data can be listed, the table names of the tables with different data and the corresponding inconsistent primary key ID list can be listed, and the record row number of the tables with the same verification can be displayed. The inconsistent primary key IDs are of three types, namely, downstream table record missing (add), downstream table record redundant (delete) and downstream table record inconsistent (update). The check report can be used as a basis for the follow-up manual repair data, and a user can select whether to repair or not after checking the upstream data and the downstream data. The verification report is exemplified as follows, so that the result of data comparison can be clearly displayed to assist a user in carrying out subsequent judgment and data restoration.

In addition, it should be noted that the embodiment of the present invention further has the following functions: in terms of checking parameters of a user configuration table, a configuration single table or multiple tables can be supported, configuration or filtering of a specified column can be supported, namely, a field black-and-white list is set, a specified data range, namely, a sphere condition can be supported, and an SQL function can be input in the sphere condition; a specified custom index (default use of primary keys) may be supported; the method can support two modes of full-quantity check and sampling check, can set any sampling rate/sampling number in the sampling check mode, and provides two modes of full-quantity sampling and latest data sampling in the sampling check mode; in the aspect of task control parameters, the concurrent number and the blocking chunk number can be supported to be set, and the verification speed can be obviously improved under high concurrency;

the method can support automatic index checking of the large table, namely, the program can check whether the field in the where condition is created according to the statement of creating the table, if not, the index is reported by mistake, so that the problem that the database is down caused by improper SQL use is prevented, and whether index checking is triggered can be judged by setting a record line threshold, namely, when the number of data lines reaches the threshold, the large table is defaulted, the index checking can be triggered, so that the database is prevented from being down; the method can also support setting random number seeds, the numerical value is a time stamp, under the default condition, the task uses the current time stamp, different data can be returned for each sampling, and after the designated seeds are configured, the same sampling data can be returned for each time, so that the method can be used for task testing and reproduction problems. That is, when the time stamps are different, different data is returned, and thus, if some data is to be tested and reproduced, a time stamp corresponding to the data can be entered, whereby data testing and inspection can be facilitated.

The embodiment of the invention can provide SQL for data comparison with heterogeneous data sources at the upstream and the downstream, unify two modes of full-quantity check and sampling check, can support setting any sampling rate and various custom configuration and control parameters under the sampling check mode, and has the functions of check field selection, automatic index check and the like. And by respectively comparing the line number and the checksum of the first data and the second data in the source database table and the target database table, the accuracy of data consistency verification can be further improved, so that the consistency of the data of the source end and the target end after data migration or data synchronization is ensured.

Fig. 6 is a block diagram of a data consistency check device 60 for heterogeneous data sources according to an embodiment of the present invention, where, as shown in fig. 6, the device 60 includes:

the obtaining module 601 is configured to obtain first data to be verified in a source database table from a source database, and obtain second data to be verified in a target database table from a target database, where the first data and the second data have a corresponding relationship;

a calculation module 602, configured to calculate a first number of lines and a first checksum of the first data, and a second number of lines and a second checksum of the second data, respectively;

The comparison module 603 is configured to compare the first number of rows with the second number of rows, and compare the first checksum with the second checksum to obtain a comparison result;

and the determining module 604 is configured to determine whether the first data and the second data pass the consistency check according to the comparison result.

In one possible implementation, the determining module 604 is further configured to, if the comparison result is: the first line number is consistent with the second line number, and the first checksum is consistent with the second checksum, the first data and the second data are determined to pass the consistency check;

if the first line number is inconsistent with the second line number and the first checksum is inconsistent with the second checksum, determining that the first data and the second data do not pass the consistency check;

if the first line number is consistent with the second line number and the first checksum is inconsistent with the second checksum, determining that the first data and the second data do not pass the consistency check.

The apparatus 60 further comprises:

the generation module is used for comparing the first line number with the second line number, comparing the first checksum with the second checksum to obtain a comparison result, and generating a verification report according to the comparison result; wherein the verification report includes at least one of: the comparison result, the parameters of the source database table, the parameters of the target database table, the first line number and the second line number.

In a possible implementation manner, the determining module 604 is further configured to determine a verification mode before obtaining, from the source database, the first data to be verified in the source database table and obtaining, from the target database, the second data to be verified in the target database table, where the verification mode includes at least one of: a full-quantity check mode, a full-quantity data sampling check mode, and a latest data sampling check mode; and determining the first data and the second data according to the verification mode.

In a possible implementation manner, the determining module 604 is further configured to determine that the first data is the full data of the source database table and determine that the second data is the full data of the target database table when the check mode is the full check mode;

when the check mode is the latest data sampling check mode, the first data is determined to be the random sampling data of the latest data aiming at the source database table, and the second data is determined to be the random sampling data of the latest data aiming at the target database table.

In a possible implementation manner, the obtaining module 601 is further configured to query a maximum value and a minimum value of a primary key of the full data of the source database table and the full data of the target database table respectively using the SELECT statement;

respectively carrying out segmentation processing on the first main key section and the second main key section, and acquiring first data based on the segmented first main key section; and acquiring second data based on the segmented second main key interval.

In a possible implementation manner, the obtaining module 601 is further configured to determine, when the verification mode is a full-quantity verification mode, a first preset number of interval segments from the first primary key interval and the second primary key interval with a preset segment length value, where the length of each interval segment is equal;

respectively calculating a first line number and a first checksum of data of each interval section in a first main key interval; and a second number of rows and a second checksum of the data of each section in the second primary key section.

In a possible implementation manner, the obtaining module 601 is further configured to sample and select a second preset number of interval segments from the first primary key interval and the second primary key interval respectively with a preset segment length value and a preset sampling rate when the verification mode is a full data sampling verification mode;

In a possible implementation manner, the obtaining module 601 is further configured to sample and select a third preset number of interval segments from the latest data of the source database table and the latest data of the target database table respectively with a preset segment length value and a preset sampling rate when the verification mode is the latest data sampling verification mode;

respectively calculating a first line number and a first checksum of data of each interval section in a source database table; and a second row number and a second checksum of the data of each interval section in the target database table.

In one possible implementation manner, the comparison module 603 is further configured to compare, in the source database table, a first number of rows of data of the current first interval to be compared with a second number of rows of data of the corresponding second interval to be compared in the target database table, so as to obtain a first comparison result; and comparing the first checksum of the data of the current first interval to be compared with the second checksum of the data of the second interval to be compared to obtain a second comparison result.

In one possible implementation manner, the comparison module 603 is further configured to compare, in the source database table, a first number of rows of data of the current first interval to be compared with a second number of rows of data of the corresponding second interval to be compared in the target database table, so as to obtain a first comparison result; and comparing the first checksum of the data of the current first interval to be compared with the second checksum of the data of the second interval to be compared to obtain a second comparison result,

performing row-by-row comparison on the data of the first interval section to be compared and the data of the second interval section to be compared until determining a row main key ID with inconsistent checksum, and recording the row main key ID with inconsistent checksum;

Therefore, after the heterogeneous data sources are synchronized, consistency verification of the data layer can be performed on the heterogeneous data sources, and the accuracy of data consistency verification can be further improved by respectively comparing the line numbers and the check sums of the first data and the second data in the source database table and the target database table, so that the consistency of the data of the source end and the target end after data migration or data synchronization is ensured.

The embodiment of the present invention further provides an electronic device 70, as shown in fig. 7, including: the processor 701, the memory 702, and the program stored in the memory 702 and capable of running on the processor 701, when executed by the processor, implement the steps of a data consistency check method for heterogeneous data sources as in the above embodiment.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements each process of the method embodiment for checking data consistency of heterogeneous data sources shown in the foregoing embodiment, and can achieve the same technical effect, so that repetition is avoided and redundant description is omitted herein. Wherein the computer readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are to be protected by the present invention.

Claims

1. A method for verifying data consistency for heterogeneous data sources, the method comprising:

2. The method of claim 1, wherein determining whether the first data and the second data pass a consistency check based on the comparison result comprises:

3. The method of claim 1, wherein after comparing the first number of rows to the second number of rows and comparing the first checksum to the second checksum to obtain a comparison result, the method further comprises:

4. The method of claim 1, wherein the method further comprises, prior to obtaining the first data to be verified in the source database table from the source database and the second data to be verified in the target database table from the target database:

5. The method of claim 4, wherein determining the first data and the second data based on the verification pattern comprises:

6. The method of claim 5, wherein obtaining first data to be verified in a source database table from a source database and obtaining second data to be verified in a target database table from a target database comprises:

7. The method of claim 6, wherein when the check mode is a full check mode, the first primary key section and the second primary key section are respectively segmented, and the first data is acquired based on the segmented first primary key section; based on the segmented second primary key interval, obtaining the second data includes:

8. The method of claim 6, wherein when the check mode is a full data sampling check mode, the first primary key section and the second primary key section are respectively segmented, and the first data is acquired based on the segmented first primary key section; based on the segmented second primary key interval, obtaining the second data includes:

9. The method of claim 6, wherein when the check mode is a latest data sampling check mode, the first primary key section and the second primary key section are respectively segmented, and the first data is acquired based on the segmented first primary key section; based on the segmented second primary key interval, obtaining the second data includes:

10. The method of any of claims 7-9, wherein comparing the first number of rows to the second number of rows and comparing the first checksum to the second checksum to obtain a comparison result comprises:

11. The method of claim 10, wherein in the source database table, a first number of rows of data of a current first interval to be compared is compared with a second number of rows of data of a corresponding second interval to be compared in the target database table to obtain a first comparison result; and comparing the first checksum of the data of the current first interval section to be compared with the second checksum of the data of the second interval section to be compared to obtain a second comparison result, wherein the method further comprises the steps of:

12. A data consistency check device for heterogeneous data sources, the device comprising:

13. An electronic device, comprising: a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the heterogeneous data source oriented data consistency check method according to any of claims 1 to 11.

14. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the heterogeneous data source oriented data consistency check method according to any of claims 1 to 11.